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Message from the USENIX Security ’11 Program Chair 


It is my pleasure to welcome you to the 20th USENIX Security Symposium, where we have an outstanding pro- 
gram of papers, talks, and other events. 


The conference received 206 submissions. Two of the submissions were withdrawn, and the remaining 204 were 
reviewed by the program committee. The authors of each paper were not revealed to reviewers. The committee 
used a multi-round reviewing process. Every paper was reviewed by at least two reviewers; papers that received 

a positive recommendation in the first round were reviewed by a third and, usually, a fourth reviewer, and many 
papers received five or more reviews. The committee, assisted by many external reviewers, produced a total of 682 
reviews. Reviewers then discussed these papers electronically, producing 826 comments in all. Finally, the program 
committee met during a two-day in-person meeting in Berkeley, California, to discuss the 67 top papers. Niels 
Provos generously served as alternate program chair for ten submissions where I had a conflict of interest, and 
Tadayoshi Kohno handled two more such submissions. 


After careful deliberation, the program committee selected 35 papers for presentation—a record high for USENIX 
Security. The quality of the papers is impressive, a tribute to the high quality of research being produced in our 
field. 


I would like to thank everyone who contributed to the success of USENIX Security °11. I am particularly grateful 
to the program committee for their hard work, enthusiasm, and conscientious efforts to ensure that each paper re- 
ceived a thorough and fair review. Thanks also to the external reviewers, listed on p. ii, for contributing their time 
and expertise. It has been an honor to work with such a dedicated and thoughtful group. The program committee 
members devoted countless hours to their work; I encourage you to thank them for their service to the community. 


Beyond the refereed papers track, we also have a strong lineup of invited talks, posters, and other events. Sandy 
Clark, Dan Geer, Dan Wallach, and Ellie Young served on the invited talks committee, and they have done an 
excellent job of assembling a slate of interesting invited talks. Patrick Traynor is the chair of this year’s Poster 
Session, and Matt Blaze is chairing the Rump Session. Dan Klein is organizing the training program. Thanks to 
Sandy, Dan, Dan, Ellie, Patrick, Matt, and Dan for their important contributions to what promises to be an interest- 
ing and fun USENIX Security program. 


I would also like to take this opportunity to thank the USENIX organization for their phenomenal support. I am 
especially grateful to Ellie Young, Anne Dickison, Casey Henderson, Jessica Horst, Jane-Ellen Long, Jennifer 
Peterson, Tony Del Porto, board liaison Matt Blaze, and the rest of the USENIX crew. Working with USENIX is 
a true joy. Their dedication to the task of running the conference is inspiring. Please join me in thanking them for 
making the conference such a success. 


Finally, I would like to thank all the authors who submitted papers to USENIX Security *11 for submitting their 
best research. 


Welcome to San Francisco, California, and the 20th USENIX Security Symposium. I hope you enjoy the confer- 
ence. 


David Wagner, University of California, Berkeley 
USENIX Security ’11 Program Chair 
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Abstract 


Web applications often use special string-manipulating 
sanitizers on untrusted user data, but it is difficult to rea- 
son manually about the behavior of these functions, lead- 
ing to errors. For example, the Internet Explorer cross- 
site scripting filter turned out to transform some web 
pages without JavaScript into web pages with valid Java- 
Script, enabling attacks. In other cases, sanitizers may 
fail to commute, rendering one order of application safe 
and the other dangerous. 

BEK is a language and system for writing sanitiz- 
ers that enables precise analysis of sanitizer behavior, 
including checking idempotence, commutativity, and 
equivalence. For example, BEK can determine if a tar- 
get string, such as an entry on the XSS Cheat Sheet, is 
a valid output of a sanitizer. If so, our analysis synthe- 
sizes an input string that yields that target. Our language 
is expressive enough to capture real web sanitizers used 
in ASP.NET, the Internet Explorer XSS Filter, and the 
Google AutoEscape framework, which we demonstrate 
by porting these sanitizers to BEK. 

Our analyses use a novel symbolic finite automata 
representation to leverage fast satisfiability modulo the- 
ories (SMT) solvers and are quick in practice, tak- 
ing fewer than two seconds to check the commutativ- 
ity of the entire set of Internet Exporer XSS filters, 
between 36 and 39 seconds to check implementations 
of HTMLEncode against target strings from the XSS 
Cheat Sheet, and less than ten seconds to check equiv- 
alence between all pairs of a set of implementations of 
HTMLEncode. Programs written in BEK can be compiled 
to traditional languages such as JavaScript and C#, mak- 
ing it possible for web developers to write sanitizers sup- 
ported by deep analysis, yet deploy the analyzed code 
directly to real applications. 


1 Introduction 


Cross site scripting (“XSS”) attacks are a plague in to- 
day’s web applications. These attacks happen because 
the applications take data from untrusted users, and then 
echo this data to other users of the application. Because 


* Authors are listed alphabetically. Work done while P. Hooimeijer 
and P. Saxena were visiting Microsoft Research. 
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web pages mix markup and JavaScript, this data may 
be interpreted as code by a browser, leading to arbitrary 
code execution with the privileges of the victim. The first 
line of defense against XSS is the practice of sanitiza- 
tion, where untrusted data is passed through a sanitizer, 
a function that escapes or removes potentially danger- 
ous strings. Multiple widely used Web frameworks offer 
sanitizer functions in libraries, and developers often add 
additional custom sanitizers due to performance 

or functionality constraints. 

Unfortunately, implementing sanitizers correctly is 
surprisingly difficult. Anecdotally, in dozens of code re- 
views performed across various industries, just about any 
custom-written sanitizer was flawed with respect to secu- 
rity [38]. The recent SANER work, for example, showed 
flaws in custom-written sanitizers used by ten web ap- 
plications [9]. For another example, several groups of 
researchers have found specially crafted pages that do 
not initially have cross site scripting attacks, but when 
passed through anti-cross-site scripting filters yield web 
pages that cause JavaScript execution [10, 22]. 

The problem becomes even more complicated when 
considering that a web application may compose multi- 
ple sanitizers in the course of creating a web page. In 
a recent empirical analysis, we found that a large web 
application often applied the same sanitizers twice, de- 
spite these sanitizers not being idempotent. This analysis 
also found that the order of applying different sanitizers 
could vary, which is safe only if the sanitizers are com- 
mutative [32], providing further evidence suggesting that 
developers have a difficult time writing correct sanitiza- 
tion functions without assistance. 

Despite this, much work in the space of detecting and 
preventing XSS attacks [19, 23, 25, 27, 39] has optimisti- 
cally assumed that sanitizers are in fact both known and 
correct. Some recent work has started exploring the is- 
sue of specification completeness [24] as well as san- 
itizer correctness by explicitly statically modeling sets 
of values that strings can take at runtime [13, 26, 36, 37]. 
These approaches use analysis-specific models of strings 
that are based on finite automata or context-free gram- 
mars. More recently, there has been significant interest 
in constraint solving tools that model strings [11, 17, 18, 
20, 31,34, 35]. String constraint solvers allow any client 
analysis to express constraints (e.g., path predicates for a 
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single code path) that include common 
string manipulation functions. 

Sanitizers are typically a small amount of code, per- 
haps tens of lines. Furthermore, application developers 
know when they are writing a new, custom sanitizer or set 
of sanitizers. Our key proposition is that if we are will- 
ing to spend a little more time on this sanitizer code, we 
can obtain fast and precise analyses of sanitizer behavior, 
along with actual sanitizer code ready to be integrated 
into both server- and client-side applications. Our ap- 
proach is BEK, a language for modeling string transfor- 
mations. The language is designed to be (a) sufficiently 
expressive to model real-world code, and (b) sufficiently 
restricted to allow fast, precise analysis, without needing 
to approximate the behavior of the code. 

Key to our analysis is a compilation from BEK pro- 
grams to symbolic finite state transducers, an extension 
of standard finite transducers. Recall that a finite trans- 
ducer is a generalization of deterministic finite automata 
that allows transitions from one state to another to be an- 
notated with outputs: if the input character matches the 
transition, the automaton outputs a specified sequence of 
characters. In a symbolic finite transducer, transitions 
are annotated with logical formulas instead of specific 
characters, and the transducer takes the transition on any 
input character that satisfies the formula. We apply algo- 
rithms that determine if two BEK programs are equiva- 
lent. We also can check if a BEK program can output a 
specific string, and if so, synthesize an input 
yielding that string. 

Our symbolic finite state transducer representation 
enables leveraging satisfiability modulo theories (SMT) 
solvers, tools that take a formula and attempt to find in- 
puts satisfying the formula. These solvers have become 
robust in the last several years and are used to solve com- 
plicated formulas in a variety of contexts. At the same 
time, our representation allows leveraging automata the- 
oretic methods to reason about strings of unbounded 
length, which is not possible via direct encoding to SMT 
formulas. SMT solvers allow working with formulas 
from any theory supported by the solver, while other 
previous approaches using binary decision diagrams are 
specialized to specific types of inputs. 

After analysis, programs written in BEK can be com- 
piled back to traditional languages such as JavaScript or 
C# . This ensures that the code analyzed and tested is 
functionally equivalent to the code which is actually de- 
ployed for sanitization, up to bugs in our compilation. 

This paper contains a number of experimental case 
studies. We conclusively demonstrate that BEK is ex- 
pressive enough for a wide variety of real-life code by 
converting multiple real world Web sanitization func- 
tions from widely used frameworks, including those used 
in Internet Explorer 8’s cross-site scripting filter, to BEK 
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programs. We report on which features of the BEK lan- 
guage are needed and which features could be added 
given our experience. We also examine other code, 
such as sanitizers from Google AutoEscape and func- 
tions from WebKit, to determine whether or not they can 
be expressed as BEK programs. We maintain samples of 
BEK programs online!. 

We then use BEK to perform security specific analy- 
ses of these sanitizers. For example, we use BEK to de- 
termine whether there exists an input to a sanitizer that 
yields any member of a publicly available database of 
strings known to result in cross site scripting attacks. Our 
analysis is fast in practice; for example, we take two sec- 
onds to check the commutativity of the entire set of In- 
ternet Explorer 8 XSS filters, and less than 39 seconds to 
check an implementations the HTMLEncode sanitization 
function against target strings from the 
XSS Cheat Sheet [5]. 

To experimentally demonstrate the difficulty of writ- 
ing correct sanitizers, we hired several freelance devel- 
opers to implement HTMLEncode functionality. Using 
BEK, we checked the equivalence of the seven differ- 
ent implementations of HTMLEncode and used BEK to 
find counterexamples: inputs on which these sanitizers 
behave differently. Finally, we performed scalability ex- 
periments to show that in practice the time to perform 
BEK analyses scales near-linearly. 


1.1 Contributions 


The primary contributions of this paper are: 


e Language. We propose a domain-specific lan- 
guage, BEK, for string manipulation. We describe a 
syntax-driven translation from BEK expressions to 
symbolic finite state transducers. 


e Algorithms. We provide algorithms for performing 
composition computation and equivalence check- 
ing, which enables checking commutativity, idem- 
potence, and determining if target strings can be 
output by a sanitizer. We show how JavaScript and 
C# code can be generated out of BEK programs, 
streamlining the client- and server-side deployment 
of BEK sanitizers. 


e Evaluation. We show that BEK can encode real- 
world string manipulating code used to sanitize un- 
trusted inputs in web applications. We demonstrate 
the expressiveness of BEK by encoding OWASP 
sanitizers, many IE 8 XSS filters, as well as func- 
tions written by freelance developers hired through 
odesk.com and vworker .com for our experiments 
presented in this paper. We show how the analy- 
ses supported by our tool can find security-critical 


'http://code.google.com/p/bek/ 
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bugs or check that such bugs do not exist. To 
improve the end-user experience when a bug is 
found, BEK produces a counter-example. We dis- 
cover that only 28.6% of our sanitizers commute, 
~79.1% are idempotent, and that only 8% are re- 
versible. We also demonstrate that most hand- 
written HTMLEncode implementations disagree on 
at least some inputs. 


e A Scalable Implementation. BEK deals with Uni- 
code strings without creating a state explosion. Fur- 
thermore, we show that our algorithms for equiv- 
alence checking and composition computation are 
very fast in practice, scaling near-linearly with the 
size of the symbolic finite transducer representation. 
The main reason for this is the symbolic representa- 
tion of the transition relation. 


While the focus of this paper is on XSS attacks”, our 
language and analyses are more general and apply to 
any string manipulating function. For example Chen er 
al. check interactions between firewall rules, finding re- 
dundant and order-dependent rules in routers [40]. Cho 
and Babi¢é [12] check the equivalence between a specifi- 
cation and an implementation for 

state machines in SMTP servers. 


2 Overview 


Figure | shows an architectural diagram for the BEK sys- 
tem. At the center of the picture is the transducer-based 
representation of a BEK program. At the moment, we 
support a BEK language front end, although other front 
ends that convert Java or C# programs into BEK are also 
possible. We provide motivating examples of the BEK 
language in Section 2.1 and discuss the applications of 
BEK to analyzing sanitizers in Section 2.2. 


2.1 Introductory Examples 


Example 1. The following BEK program is a basic san- 
itizer that backslash-escapes single and double quotes 
(but only if they are not escaped already). The iter con- 
struct is a block that uses a character variable c and a 
single boolean state variable b that is initially f (false). 
Each iteration of the block binds the character variable to 
a single character of the string ¢; iteration continues un- 
til no more characters remain. The block is broken into 
case statements. If a character satisfies the condition of 
the case statement, the corresponding code is executed. 


?The dual of the issue of code injection is data privacy; BEK is 
equally suitable to analyzing the corresponding data cleansing func- 
tions. 
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Figure 1: BEK architecture. We use a representation 
based on symbolic finite state transducers (defined in- 
text) to model string sanitization code without approxi- 
mation. 








private static string EncodeHtml(string t) 
{ 
if (t == null) { return null; } 
if (t.Length == 0) { return string.Empty; } 
StringBuilder builder = 
new StringBuilder("", t.Length * 2); 
foreach (char c in t) 
{ 
if (CC(c > 76?) && Ce < 7£7)) II 
(Cc > 7@?) && Cc < 707))) II (Cc vt) ih 
(Cc > 7/7) && Ce < 7:7))) II CC Pt) iA 
(c == 7,7)) I] Cc == ?-?) II Co == °_7)))D04 
builder. Append(c) ; 
} else { 
builder. Append ("&#" + 
(Cint) c).ToString() + ";"); 


i} 
+ 
return builder.ToString() ; 


} 


Figure 2: Code for AntiXSS.EncodeHtml version 2.0. 


Here yield(c) outputs the current character c. 


iter(cint) {b:=f;} { 

case(7(b) A (c= °°? Ve=*")) 
b:=f; yield(‘\’); yield(c); } 

case(c = ‘\’) { 
b:= 4(b); yield(c); } 

case(t) { 
b:=f; yield(c); } 

} 


The boolean variable b is used to track whether the previ- 
ous character seen was an unescaped slash. For example, 
in the input \\" the double quote is not considered es- 
caped, and the transformed output is \\\". If we apply the 
BEK program to \\\" again, the output is the same. An 
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interesting question is whether this holds for any output 
string. In other words, we may be interested in whether 
a given BEK program is idempotent. 

If implemented incorrectly, double applications of 
such sanitization functions can result in duplicate escap- 
ing. This in turn has led to command injection of script- 
injection attacks in the past. Therefore, checking idem- 
potence of certain functions is practically useful. We 
will see in the next section how BEK can perform such 
checks. Xx 


Example 2. The code in Figure 2 is from the public 
Microsoft AntiXSS library. The sanitizer iterates over 
the input character-by-character. Depending on the char- 
acter encountered, a different action is taken, such as in- 
cluding the character verbatim or encoding it in some 
manner, such as numeric HTML escaping. 

The BEK program corresponding to EncodeHtm1 is 


iter (cin t){ 
case (>(c)){ 
yield [*&’, ‘#’] + dec(c) + [*;’];} 
case(true) { 
yield [c]; }} 


where dec is a built-in library function that returns the 
decimal representation of the character and y(c) is the 
formula 


(‘al <cAcK< ‘2')V(%A! SCACK< YZ") V 
(0" <cAc< 9") Ve=H*'VeH NV 
c= “e Vez *-'Ve=zt 


The BEK program iterates over each character of the 
input. If the character satisfies the formula y(c), then the 
program outputs the character. Otherwise the program 
escapes the character by outputting its decimal encod- 
ing, together with the &# prefix and semicolon. Note 
that this sanitizer is not idempotent, because applying the 
function twice to the string &# will result in double es- 
caping. Our tool can detect this in under a second. XI 


Multiple implementations may exist of the “same” 
sanitizer. For example, Figure 3 shows the result of run- 
ning the Red Gate Reflector .NET decompiler on the Sys- 
tem.NET implementation of EncodeHTML. We have con- 
verted this code to BEK as well, noticing that the goto 
structure is the result of a loop after decompilation. Us- 
ing our analyses, we can check these implementations for 
equivalence. Our implementation can detect in less than 
one second that the System.NET implementation does 
not escape single quote characters, while the AntiXSS 
implementation does, meaning that the two implementa- 
tions are not equivalent. Failure to escape single quotes 
can lead to XSS attacks, so this 
difference is significant [33]. 
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public static string EncodeHtml(string s) 
t 
if (s == null) 
return null; 
int num = IndexOfHtmlEncodingChars(s, 0); 
if (num == -1) 
return s; 
StringBuilder builder=new StringBuilder(s.Length+5) ; 
int length = s.Length; 
int startIndex = 0; 
Label_002A: 
if (num > startIndex) { 
builder.Append(s, startIndex, num-startIndex) ; 
5 
char ch = s[nun]; 
if {ch > **) { 
builder. Append ("&#") ; 
builder. Append(((int) ch). 
ToString (NumberFormatInfo.InvariantInfo)) ; 
builder. Append(’;’); 
} 
else { 
char ch2 = ch; 
if (ch2 !=°"’) { 
switch (ch2) 
{ 
case ’<?: 
builder. Append("&1t;") ; 
goto Label_0OD5; 


case ?=?: 
goto Label_0OD5; 


case ’?>?: 
builder. Append("&gt;") ; 
goto Label_0OD5; 


case ’&’: 
builder. Append("&amp;") ; 
goto Label_0OD5; 
} 
} 
else { 
builder. Append("&quot;") ; 
} 
} 
Label_OOD5: 
startIndex = num + 1; 
if (startIndex < length) { 
num = IndexOfHtmlEncodingChars(s, startIndex) ; 
if (num != -1) { 
goto Label_002A; 
} 
builder.Append(s, startIndex, length-startIndex) ; 
} 
return builder.ToString() ; 


} 


Figure 3: Code for EncodeHtml from version 2.0 of 
System.Net. This code is not equivalent to the AntixSS 
library version. 


2.2 Security Applications 


Web sanitizers are the first line of defense against cross- 
site scripting attacks for web applications: they are func- 
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tions applied to untrusted data provided by a user that 
attempt to make the data “safe” for rendering in a web 
browser. Reasoning about the security properties of web 
sanitizers is crucial to the security of web applications 
and browsers. Formal verification of sanitizers is there- 
fore crucial in proving the absence of injection attacks 
such as cross-site and cross-channel scripting as well as 
information leaks. 


2.2.1 Security of Sanitizer Composition 


Recent work has demonstrated that developers may 
accidentally compose sanitizers in ways that are not 
safe [32]. BEK can check two key properties of sanitizer 
composition: commutativity and idempotence. 


Commutativity: Consider two default sanitizers in 
the Google CTemplate framework: JavaScriptEscape 
and HTMLEscape [4]. The former performs Uni- 
code encoding (\uooxx) for safely embedding untrusted 
data in JavaScript strings while the latter sanitizer per- 
forms HTML entity-encoding (#1t;) for embedded un- 
trusted data in HTML content. It turns out that if 
JavaScriptEscape is applied to untrusted data before 
the application of HTMLEscape, certain XSS attacks are 
not prevented [32]. The opposite ordering does prevent 
these attacks. BEK can check if a pair of sanitizers are 
commutative, which would mean the programmer does 
not need to worry about this class of bugs. 


Idempotence: BEK can check if applying the sanitizer 
twice yields different behavior from a single application. 
For example, an extra JavaScript string encoding may 
break the intended rendering behavior in the browser. 


2.2.2 Sanitizer Implementation Correctness 


Hand-coded sanitizers are notoriously difficult to write 
correctly. Analyses provided by BEK help achieve cor- 
rectness in three ways. 


Comparing multiple sanitizer implementations: Mul- 
tiple implementations of the same sanitization function- 
ality can differ in subtle ways [9]. BEK can check 
whether two different programs written in the BEK lan- 
guage are equivalent. If they are not, BEK exhibits inputs 
that yield different behaviors. 


Comparing sanitizers to browser filters: Internet Ex- 
plorer 8 and 9, Google Chrome, Safari, and Firefox em- 
ploy built-in XSS filters (or have extensions [3]) that ob- 
serve HTTP requests and responses [1,2] for attacks. 
These filters are most commonly specified as regular 
expressions, which we can model with BEK. We can 
then check for inputs that are disallowed by browser fil- 
ters, but which are allowed by sanitizers. For example, 
BEK can determine that the AntiXSS implementation of 
the EncodeHTML sanitizer in Figure 2 does not block 
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Bool Variables Opes 
Char Variables c 
String Variables ¢ 


Bool Constants B € {t, f} 
Char Constants d € 


Strings sexpr ::= iter(cin sexpr) {init} {case* } 
| fromLast(ccond, sexpr) 
| uptoLast(ccond, sexpr) | t 
initzi= (b := B)* 
case ::= case(bexpr) {cstmt}| endcase 
endcase ::= end(ebexpr){yield(d)* } 
estmt ::= (b := ebexpr; | yield(cexpr);)* 
Booleans bexpr ::= Boolcomb(bexpr) |B | b| ccond 
ebexpr ::= Boolcomb(ebexpr) |B |b 
econd ::= Boolcomb(ccond) |cexpr = cexpr 
| cexpr < cexpr | cexpr > cexpr 
Char strings cexpr ::= c|d| built-in-fnc(c) | cexpr + cexpr 


Figure 4: Concrete syntax for BEK. Well-formed BEK 
expressions are functions of type string — string; 
the language provides basic constructs to filter and trans- 
form the single input string t. Boolcomb(e) stands for 
Boolean combination of e using conjunction, disjunc- 
tion, and negation. 


strings such as javascripts#58; which are prevented by 
IE 8 XSS filters. These differences indicate potential 
bugs in the sanitizer or the filter. 


Checking against public attack sets: Several pub- 
lic XSS attack sets are available, such as XSS cheat 
sheet [5]. With BEK, for all sanitizers, for all attack vec- 
tors in an attack set, we can check if there exists an input 
to the sanitizer that yields the attack vector. 


3 The BEK Language and Transducers 


In this section, we give a high-level description of a 
small imperative language, BEK, of low-level string op- 
erations. Our goal is two-fold. First, it should be possible 
to model BEK expressions in a way that allows for their 
analysis using existing constraint solvers. Second, we 
want BEK to be sufficiently expressive to closely model 
real-world code (such as Example 2). In this section 
we first present the BEK language. We then define the 
semantics of BEK programs in terms of symbolic finite 
transducers (SFTs), an extension of classical finite state 
transducers. Finally, we describe several core decision 
procedures for SFTs that provide an algorithmic founda- 
tion for efficient static analysis 

and verification of BEK programs. 


3.1 The BEK Language 


Figure 4 describes the language syntax. We define a sin- 
gle string variable, ¢, to represent an input string, and 
a number of expressions that can take either ¢ or an- 
other expression as their input. The uptoLast(y, t) and 
fromLast(y,t) are built-in search operations that ex- 
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tract the prefix (suffix) of ¢ upto (from) and excluding 
the last occurrence of a character satisfying y. These 
constructs are listed separately because they cannot be 
implemented using other language features. Finally, the 
iter construct allows for character-by-character iteration 
over a string expression. 


Example 3. uptoLast(c =  ‘.’,"w.abc.org") 
= "www.abc", fromLast(c = ‘.’,"w.abc.org") 
="org". Xl 


The iter construct is designed to model loops that tra- 
verse strings while making imperative updates to boolean 
variables. Given a string expression (sexpr), a char- 
acter variable c, and an initial boolean state (znzt), the 
statement iterates over characters in sexpr and evaluates 
the conditions of the case statements in order. When a 
condition evaluates to true, the statements in cstmt may 
yield zero or more characters to the output and update the 
boolean variables for future iterations. The endcase ap- 
plies when the end of the input string has been reached. 
When no case applies, this correspond to yielding zero 
characters and the iteration continues or the loop termi- 
nates if the end of the input has been reached. 


3.2 Finite Transducers 


We start with the classical definition of finite state trans- 
ducers. The particular sublass of finite transducers that 
we are considering here are also called generalized se- 
quential machines or GSMs [29], however, this defini- 
tion is not standardized in the literature, and we there- 
fore continue to say finite transducers for this restricted 
case. The restriction is that, GSMs read one symbol at 
each transition, while a more general definition allows 
transitions that skip inputs. 


Definition 1. A Finite Transducer A is defined as a six- 
tuple (Q, q°, F, 5,1, A), where Q is a finite set of states, 
q° € Qis the initial state, F C Q is the set of final states, 
» is the input alphabet, T is the output alphabet, and A 
is the transition function from Q x ¥ to 22%!" 


We indicate a component of a finite transducer A by 
using A as a subscript. For (g,v) € Aa(p, a) we define 


the notation p ae q, where p,q € Qa, a € Ny and 


. a/yu : 
v € 1%. We write p as q when A is clear from the 
context. Given words v and w we let v - w denote the 
concatenation of v and w. Note thatu-« =e€-v=v. 
: ai /Vi ‘ : u/v 

Given gq; —> 4 Gi+1 for? < n we write ga — 4 Gn 
where u = dg-@1-...°Gn_1 and v = Up-U1-.. .-Un—1. We 

. e/e P . c 
write also q aks A q. A induces the finite transduction, 
T,4 : 4 Qha: 


le = u/v 
Ta(u) = {v | 5g € Fa (qi 3 @)} 
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We lift the definition to sets, Ta(U) = U,cy T(w). 
Given two finite transductions 7, and T>, 7) o T> de- 
notes the finite transduction that maps an input word wu to 
the set T2(T(u)). In the following let A and B be finite 
transducers. A fundamental composition of A and B is 
the join composition of A and B. 


Definition 2. The join of A and B is the finite transducer 


AoB = (QaxQz, (44,98), Fax Fe, Da, 0p, Asoz) 
where, for all (p,¢) € Qa X Qp anda € Ny: 


def 


Ascn((p.@,a) = {((p',9),6<) |p Sap} 


U {((p',q’),v) | (Gu € TS) 
p Sap, qn q'} 





The following property is well-known and allows us 
to drop the distinction between A and T'4 
without causing ambiguity. 


Proposition 1. 74.5 = T,4 0 Tz. 


The following classification of finite transducers plays a 
central role in the sections discussing translation from 
BEK and decision procedures for 

symbolic finite transducers. 


Definition 3. A is single-valued if for all u € b%, 
|A(u)| <1. 


3.3. Symbolic Finite Transducers 


Symbolic finite transducers, as defined below, provide a 
symbolic representation of finite transducers using terms 
modulo a given background theory 7. The background 
universe VY of values is assumed to be multi-sorted, where 
each sort o corresponds to a sub-universe V°. The 
boolean sort is BOOL and contains the truth values t 
(true) and f (false). Definition of terms and formulas 
(boolean terms) is standard inductive definition, using 
the function symbols and predicate symbols of 7, log- 
ical connectives, as well as uninterpreted constants with 
given sorts. All terms are assumed to be well-sorted. A 
term t of sort o is indicated by t : 0. Givena term? anda 
substitution 6 from variables (or uninterpreted constants) 
to terms or values, Subst(t, 0) denotes the term resulting 
from applying the substitution 6 to t. 

A model is a mapping of uninterpreted constants to 
values.? A model for a term t is a model that provides 
an interpretation for all uninterpreted constants that oc- 
cur in t. (All free variables are treated as uninterpreted 
constants.) The interpretation or value of a term t in a 


3The interpretations of background functions of T is fixed and is 
assumed to be an implicit part of all models. 
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model M for ¢ is given by standard Tarski semantics us- 
ing induction over the structure of terms, and is denoted 
by t”. A formula (predicate) y is true in a model 
for y, denoted by M | 4, if y™ evaluates to true. A 
formula ¢ is satisfiable, denoted by IsSat(y), if there 
exists a model M such that MW — vy. Any term t:o that 
includes no uninterpreted constants is called a value term 
and denotes a concrete value [t] € V7. 

Let Term7-(£) denote the set of all terms in T of sort 
y, where = %,...,%n—1 May occur as the only un- 
interpreted constants (variables). Let Pred7(%) denote 
Term5°°" (Z). In order to avoid ambiguities in notation, 
given a set E of elements, we write [e9,...,@n—1] for 
elements of E*, i.e., sequences of elements from E. We 
use both [] and € to denote the empty sequence. As above, 
if ey,eg2 € E*, then ey - eg € E* denotes the con- 
catenation of e; with eg. We lift the interpretation of 
terms to apply to sequences: for u = [uo,...,Un—1] € 
Term2-(z)* let u™ = [ud,...,uM@ 1] € (V7). 

In the following let c:o be a fixed uninterpreted con- 
stant of sort 0. We refer to c:o as the input variable (for 
the given sort o). 








Definition 4. A Symbolic Finite Transducer (SFT) for T 
is a six-tuple (Q, q°, F,0, 7,6), where Q is a finite set of 
states, q®? € Q is the initial state, F C Q is the set of 
final states, o is the input sort, y is the output sort, and 


6 is the symbolic transition function from Q x Predy(c) 
to 22x Term>-(c)* ; 


We use the notation p oe q for (q,u) € da(p,y) 


and call p oe q a symbolic transition, p/u is called 
its label, yp is called its input (guard) and u its output. 
An SFT A = (Q,q°, F,0,7,6) denotes the finite 


transducer [A] = (Q, 4°, F, V7, V7, A) where p “3,4 


q if and only if there exists p pd A q anda model W 
such that MW F— y, ™ =a,u” =v. 

For an STF A let the underlying transduction T’4 be 
Ty}. For a state g € Qa let T4(v) (Tp (%)) denote 
the set of outputs when starting from q with input v. In 
particular, if ¢ = q% then To = TY and Tray = Ty. 
The following proposition follows directly from the ee 
inition of [A]. 





Proposition 2. For v € Ufa) and q € Qa: TA(v) = 
Tia (&): 

Example 4. The identity SFT Id (for sort 7) is defined 
follows. Id = ({q},¢, {a}.0,0, {¢ of q}). Thus, for 
alla € V’, q cee q, and [Id](v) = {v} for all 
eel)" X 


Example 5. Assume o is the sort for characters. The 
predicate c = *.’ says that the input character is a dot. 
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(c#/.)/el 








(e=".’)/[e] 


een 


(t)/[e] 


Figure 5: Symbolic finite state transducer for 
uptoLast(c=*.’, input). This transducer is non- 
deterministic; there are two transitions that match *.’ 
from state qo. 


The SFT UptoLastDot such that for all strings v, 
UptoLastDot(v) = uptoLast(c = *.’,v), 


where uptoLast is the BEK function introduced above, 
is shown in Figure 5. X 


Composition works directly with SFTs, and keeps the 
resulting SFT clean in the sense that all symbolic transi- 
tions are feasible, and eliminates states that are unreach- 
able from the initial state as well as non-initial states 
that are not backwards reachable from any final state. In 
order to preserve feasibility of transitions the algorithm 
uses a solver for checking satisfiability of formulas in 
Predy(c). 


3.4 BEK to SFT translation 


The basic sort needed in this section, besides BOOL, is 
a sort CHAR for characters. We also assume the back- 
ground relation <: CHAR X CHAR —> BOOL as a Strict 
total order corresponding to the standard lexicographic 
order over ASCII (or Unicode) characters and assume >, 
< and > to be defined accordingly. We also assume that 
each individual character has a built-in constant such as 
‘a’ :CHAR. For example, 


(*A’ S<cAcK< °Z')V (tal <cAcK< ‘2’)V 
(.0' <cAc< °9')Ve= 


descibes the regex character class \w of all word char- 
acters in ASCII. (Direct use of regex character classes 
in BEK, such as case(\w) {...}, is supported in the en- 
hanced syntax supported in the BEK analyzer tool.) 
Each sexpr e is translated into an SFT SFT (e). 
For the string variable t, SFT(e) = Id, with Id 
as in Example 4. The translation of uptoLast(y, e) 
is the symbolic composition STF(e) o B where B 
is an SFT similar to the one in Example 5, except 
that the condition c = *.’ is replaced by y. The 
translation of fromLast(y,e) is analogous. Finally, 
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SFT(iter(c in e) {init} {case*}) = SFT(e) o B 
where B = (Q, q°, Q, CHAR, CHAR, 6) is 
constructed as follows: 


Step 1: Normalize. Transform case* so that case con- 
ditions are mutually exclusive by adding the nega- 
tions of previous case conditions as conjuncts to all 
the subsequent case conditions, and ensure that each 
boolean variable has exactly one assignment in each 
cstmt (add the trivial assignment b := b 
if b is not assigned). 

Step 2: Compute states. Compute the set of states Q. 
Let q° be an initial state as the truth assignment to 
boolean variables declared in init.4 Compute the 
set Q of all reachable states, by using DFS, such 
that, given a reached state q, if there exists a case 
case(y) {cstmt} such that Subst(y, q) is satisfi- 
able then add the state 


{br [Subst(w,q)] | b:=~ € cstmt} (1) 


to Q. (Note that Subst(w, q) is a value term.) 

Step 3: Compute transitions. Compute the symbolic 
transition function 6. For each state gq € Q and 
for each case case(y) {cstmt} such that 6 = 
Subst(y,q) is satisfiable. Let p be the state com- 
puted in (1). Let yield(uo),...,yield(u,_1) be 
the sequence of yields in cstmt and let u = 
[wo,---,Un—1]. Add the symbolic 


transition q adil ptod. 


The translation of end-cases is similar, resulting in sym- 
bolic transitions with guard c = L, where _ is a spe- 
cial character used to indicate end-of-string. We assume 
t to be least with respect to <. For example, assum- 
ing that the BEK programs use concrete ASCII charac- 
ters, L:CHAR is either an additional character, or the null 
character ‘\o’ if only null-terminated strings are consid- 
ered as valid input strings. Although practically impor- 
tant, end-cases do not cause algorithmic complications, 
and for the sake of clarity we avoid them 

in further discussion. 

The algorithm uses a solver to check satisfiability of 
guard formulas. If checking satisfiability of a formula for 
example times out, then it is safe to assume satisfiabil- 
ity and to include the corresponding symbolic transition. 
This will potentially add infeasible guards but retains the 
correctness of the resulting SFT, meaning that the under- 
lying finite transduction is unchanged. While in most 
cases checking satisfiability of guards seems straight- 
forward, but when considering Unicode, this perception 
is deceptive. As an example, the regex character class 


4Note that q° is the empty assignment if init is empty, which trivi- 
alizes this step. 
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(od "\'F)/le] 
(c="\')/Iel 
“Oo 
(cet) /TNV, ] 


Figure 6: SFT for BEK program in Example 1. This 
SFT escapes single and double quotes with a backslash, 
except if the current symbol is already escaped. The ap- 
plication of this SFT is idempotent. 


[\w-[\D]] denotes an empty set since \d is a subset of 
\w and \w (\D) is the complement of \w (\a), and thus, 
[\W-[\D]] is the intersection of \w and \a. Just the charac- 
ter class \w alone contains 323 non-overlapping ranges in 
Unicode, totaling 47,057 characters. A naive algorithm 
for checking satisfiability (non-emptiness) of [\W-[\D]] 
may easily time out. 


Consider the BEK program in Example |. The cor- 
responding SFT constructed by the above translation is 
shown in Figure 6. There are two symbolic transitions 
from state qo to itself. The first corresponds to the cases 
where the input character c needs to be escaped, and the 
second to cases where the input does not 
need to be escaped. 


3.5 Join Composition and Equivalence 


We now give an informal description of our core algo- 
rithms for reasoning about SFTs: join composition and 
equivalence. We then show how these algorithms can be 
used to check properties such as idempotence, existence 
of an input yielding a target string, and commutativity. 


The join composition A o B corresponds to a program 
transformation that constructs a single loop over the in- 
put string out of two consecutive loops in SFTs A and B. 
The join composition algorithm constructs an SFT Ao B 
such that Ty 40p] = Tya4j° 7 {a}. The intuition behind the 
construction is that the outputs produced by A are sub- 
stituted symbolically in as the inputs consumed by the 
B. The composition algorithm proceeds by depth-first 
search, first computing Q 40g as constructed as a reach- 
able subset of Q4 x Qz, starting from (q%,¢q%). Here 
we use the SMT solver to determine reachability, calling 
the solver as a black box to determine if a path from one 
state to another is feasible or not. This makes our con- 
struction independent of the particular background the- 
ory. In general, this is not true for other recent exten- 
sions of finite transducers such as streaming transduc- 
ers [6], where compositionality depends on properties of 
the background theory that is being used. 
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Two SFTs A and B are equivalent if T4 = Tg. Let 
Dom(A) = {v | Ta(v) 4 O}. 


Checking equivalence of A and B reduces to two sepa- 
rate tasks: 


Dom(A) 


l| 


1. Deciding domain-equivalence: 
Dom(B). 


2. Deciding partial-equivalence: for all v € 
Dom(A) NM Dom(B), Ta(v) = Tp(v). 


Note that 1 and 2 are independent and do not imply 
each other, but together they imply equivalence. Do- 
main equivalence holds for all SFTs constructed by BEK, 
because all programs share the same domain, namely 
that of strings. Checking partial equivalence is more in- 
volved. We leverage the fact that all SFT's we construct 
are single-valued. Our equivalence algorithm first com- 
putes the join composition of A and B, then uses the 
SMT solver to search for inputs that cause A to differ 
from B. We have a nonconstructive proof of termina- 
tion for this algorithm: it establishes that if A and B 
are equivalent, then the search must terminate in time 
quadratic in the number of states of the composed au- 
tomata. In practice, the SMT solver carries out this 
search, and our results in Section 4 show scaling is closer 
to linear in practice. 

Equivalence and join composition allow us to carry out 
a variety of other analyses. Idempotence of an SFT A 
can be first checked by computing B = A o A, then 
checking the equivalence of A and B. If the two SFTs are 
not equivalent, then A fails to be idempotent. Similarly, 
commutativity of two SFTs A and B can be determined 
by computing C = Ao B and D = BoA, then checking 
equivalence. The idea is illustrated in Figure 7. We can 
also compute the inverse image of a SFT with respect to a 
string s, which lets us find out the set of inputs to the SFT 
that yield s as an output. We use all of these analyses to 
check sanitizers for security 
properties in the next section. 








3 “input string” 


Anot 
idempotent 


5 “input string” 


Aand B not 
commutative 


Figure 7: Using composition and equivalence of SFTs 
to decide idempotence and commutativity. 
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Our approach has an advantage over traditional finite 
transducers (FTs), due to succinctness of SFTs. Suppose 
for example that the background character theory J is k- 
bit bit vector arithmetic where k depends on the desired 
character range (e.g., for Unicode, k = 16). An explicit 
expansion of a BEK SFT A to [A] may increase the size 
(nr of transitions) by a factor of 2”. Partial-equivalence 
of single-valued FTs is solvable O(n”) [15] time. Thus, 
for an SFT A of size n, using the partial-equivalence al- 
gorithm for [A] takes O((2*n)?) time. In contrast, the 
partial-equivalence algorithm for BEK SFTs is O(n?). 
When the background theory is linear arithmetic, then 
the alphabet is infinite and a correspoding FT algorithm 
is therefore not even possible. 


4 Evaluation 


In the following subsections, we evaluate the real-world 
applicability of BEK in terms of expressivess, 
utility, and performance: 


e Section 4.1 evaluates whether BEK can model ex- 
isting real-world code. We conduct an emperical 
study of a large body of code to see how widely- 
used BEK-modelable sanitizer functions are (Sec- 
tion 4.1.1), and we evaluate which BEK features 
are needed to model sanitizers from AutoEscape, 
OWASP, and Internet Explorer 8 (Section 4.1.2). 


e We put BEK to work to check existing sanitizers for 
idempotence, commutativity, and reversibility (Sec- 
tion 4.2). 


e We perform pair-wise equivalence checks on a num- 
ber of ported HTMLEncode implementations, as well 
as two outsourced implementations (Section 4.3). 


e We evaluate effectiveness of existing HTMLEncode 
implementations against known attack strings taken 
from the Cross-site Scripting Cheat Sheet (Sec- 
tion 4.4). 


e We use a synthetic benchmark to evaluate the scal- 
ability of performing equivalence checks on BEK 
programs (Section 4.5). 


e We provide a short example to highlight the fact 
that BEK programs can be readily translated to other 
programming languages (Section 4.6). 


These experiments are based on an implementation that 
consists of roughly 5,000 lines of C# code that imple- 
ments the basic transducer algorithms and Z3 [14] inte- 
gration, with another 1,000 lines of F# code for transla- 
tion from BEK to transducers. Our experiments were car- 
ried out on a Lenovo ThinkPad W500 laptop with 8 GB 
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of RAM and an Intel Core 2 Duo P9600 processor run- 
ning at 2.67 GHz, running 64-bit Windows 7. 


4.1 Expressive Utility 


Thus far, we discussed the expressiveness of BEK pri- 
marily in theoretical terms. In this subsection, we turn 
our attention to real-world applicability instead, through 
a case study that aims to demonstrate that a wide variety 
of commonly used sanitizers can be ported to 

BEK with relative ease. 


4.1.1 Frequency of Sanitizer use in PHP code. 


PHP is a widely-used open source server-side scripting 
language. Minamide’s seminal work on the static anal- 
ysis of dynamic web applications [26] includes finite- 
transducer based models for a subset of PHP’s sanitizer 
functions. These transducers are hand-crafted in several 
thousand lines of OCaml. We conducted an informal re- 
view of the PHP source to confirm that each transducer 
could be modeled as a BEK program. 

Our goal is to perform a high-level quantitative com- 
parison of the applicability of BEK, on the one hand, 
and existing string constraint solvers (e.g., DPRLE [17], 
Hampi [20], Kaluza [30], and Rex [35]) on the other. For 
this comparison, we assume that each Minamide trans- 
ducer could instead be modeled as a BEK program. We 
then use statistics from a study by Hooimeijer [16] that 
measured the relative frequency, by static count, of 111 
distinct PHP string library functions. The Hooimeijer 
study was conducted in December 2009, and covers the 
top 100 projects on SourceForge . net, or about 9.6 mil- 
lion lines of PHP code. The study considered most, but 
not all, sanitizers provided by Minamide. 

Out of the 111 distinct functions considered in the 
Hooimeijer study, 27 were modeled as transducers by 
Minamide and thus encodable in BEK. In the sam- 
pled PHP code, these 27 functions account for 68, 238 
out of 251,317 uses, or about 27% of all string-related 
call sites. By comparison, traditional regular expression 
functions modeled by tools like Hampi [20] and Rex [35] 
account for just 29,141 call sites, or about 12%. We note 
that BEK could be readily integrated into an automaton- 
based tool like Rex, however, and our features are largely 
complimentary to those of traditional string constraint 
solvers. These results suggest that BEK provides a signif- 
icant improvement in the “coverage” of real-world code 
by string analysis tools. 


4.1.2 Language Features 


For the remainder of the experiments, we use a small 
dataset of ported-to-BEK sanitizers. We now discuss 
that dataset and the manual conversion effort required. 
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The results are summarized in Figure 8, and described in 
more detail below. 


Google AutoEscape and OWASP. We converted san- 
itizers from the OWASP sanitizer library to BEK pro- 
grams. We also evaluated sanitizers from the Google 
AutoEscape framework to determine what language fea- 
tures they would need to be expressed in BEK. These 
sanitizers are marked with prefixes GA and OWASP, re- 
spectively, in Figure 8. We verified that each of these 
sanitizers can be implemented in BEK. In several cases, 
we find additional non-native features that could be 
added to BEK to support these sanitizers. 


Internet Explorer. In addition, we extracted sanitizers 
from the binary of Internet Explorer 8 that are used 
in the IE Cross-Site Scripting Filter feature, denoted 
IEFilter1 to IEFilter17 in Figure 8. For this study, 
we analyze the behavior of the IE 8 sanitizers under 
the assumption the server performs no sanitization of 
its own on user data. Of these 21 sanitizers, we could 
convert 17 directly into BEK programs. The remaining 4 
sanitizers track a potentially unbounded list of characters 
that are either emitted unaltered or escaped, depending 
on the result of a regular expression match. BEK does 
not enable storing strings of input characters. 


The manual translation took several hours per sani- 
tizer. Figure 8 breaks down our BEK programs based on 
“Native” features of the BEK language, and “Not Native” 
features which are not currently in the BEK language. 
Many of these features can be integrated modeled using 
transducers, however, by enhancing the language of con- 
straints used for symbolic labels. In addition, with the 
exception of 4 Internet Explorer sanitizers, we found that 
a maximum lookahead window of eight characters would 
suffice for handling all our sanitizers. Finally, we discov- 
ered that the arithmetic on characters was limited to right 
shifts and linear arithmetic, which can be expressed in 
the Z3 solver we use. 

We note that all “Not Native” features could be added 
to the BEK language with few or no changes to the under- 
lying SFT algorithms for join composition and equiva- 
lence checking: only the front end would need to change. 


4.1.3. Browser Code 


Ideally, we could use BEK to model the parser of an ac- 
tual web browser. Then, we could use our analyses to 
check whether there exists a string that passes through a 
given sanitizer yet causes javascript execution. We per- 
formed a preliminary exploration of the WebKit browser 
to determine how difficult it would be to write such 
a model with BEK. Unfortunately, we found multiple 
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Figure 8: Expressiveness: different language features 
used by the original corpus of different programs. A 
cross means that the feature was not used by the pro- 
gram in its initial implementation. A checkmark means 
the feature was used by the program. boolean variables, 
multiple iterations over a string, and regular expressions 
are native constructs in BEK. Multiple lookahead, arith- 
metic, and functions are not native to BEK and must be 
emulated during the translation. We also show the dis- 
tinct boolean variables used by the BEK implementation. 


functions that require features, such as bounded looka- 
head and transducer composition, which are not yet sup- 
ported by the BEK language. 


For example, we considered a function in the Safari 
implementation of WebKit that performs Javascript de- 
coding [7]. This function requires at a minimum the use 
of functions to connect hexadecimal to ASCII, a looka- 
head of 5 characters, function composition, and scan- 
ning for occurrences of a target character. While as 
noted above we believe these features could be added 
to BEK without fundamentally changing the underlying 
algorithms for symbolic transducers, the BEK language 
does not yet support them. 
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4.2 Checking Algebraic Properties 


We argued in Section 2 that idempotence and commuta- 
tivity are key properties for sanitizers. In addition, the 
property of reversibility, that from the output of a sani- 
tizer we can unambiguously recover the input, is impor- 
tant as an aid to debugging. 


4.2.1 Order Independence 


We now evaluate whether 17 sanitizers used in IE 8 are 
order independent. Order independence means that the 
sanitizers have the same effect no matter in what order 
they are applied. If the order does matter, then the choice 
of order can yield surprising results. As an example, in 
rule-based firewalls, a set of rules that are not order in- 
dependent may result in a rule never being applied, even 
though the administrator of the firewall believes the rule 
is in use. 

Each IE 8 sanitizer defines a specific input set on 
which it will transform strings, which we can compute 
from the BEK model. We began by checking all 136 pairs 
of IE 8 sanitizers to determine whether their input sets 
were disjoint. Only one pair of sanitizers showed a non- 
trivial intersection in their input sets. A non-trivial in- 
tersection signals a potential order dependence, because 
the two sanitizers will transform the same strings. For 
this pair, we used BEK to check that the two sanitizers 
output the same language, when restricted to inputs from 
their intersection. BEK determined that the transforma- 
tion of the two sanitizers on thesel inputs was exactly the 
same — i.e., the two sanitizers were equivalent on the 
intersection set. We conclude that the IE 8 sanitizers are 
in fact order independent, up to errors in our extraction 
of the sanitizers and our assumption that no server-side 
modification is present. 


4.2.2 Idempotence and Reversibility 


We now examine the idempotence of several BEK pro- 
grams, including the IE 8 sanitizers. Figure 9 reports 
the results. The number of states in the symbolic finite 
transducer created from each BEK program. For each 
transducer, we then report whether it is idempotent and 
whether it is reversible. This shows the number of states 
acts as a rough guide to the complexity of the sanitizer. 
For example, we see that IE filter 9 out of 17 is quite 
complicated, with 25 states. 


4.2.3 Commutativity 


We investigated commutativity of seven different imple- 
mentations of HTMLEncode, a sanitizer commonly used 
by web applications. Four implementations were gath- 
ered from internal sources. Three were created for our 
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Name States Idempotent? Reversible? 
a2bb2a 1 x v 
escapeBrackets 1 v x 
escapeMetaAndLink 1 v v 
escapeString0O 1 x x 
escapeString 1 x x 
escapeStringSimple 1 x x 
getFileExtension 2 x x 
IEFilter1 6 v x 
IEFilter2 9 v x 
IEFilter3 19 v x 
IEFilter4 13 v x 
IEFilter5 13 v x 
IEFilter6 16 v x 
IEFilter7 13 v x 
IEFilter8 12 v x 
IEFilter9 25 v x 
IEFilter10 18 v x 
IEFilter11 11 v x 
IEFilter12 11 v x 
IEFilter13 14 v x 
IEFilter14 14 v x 
IEFilter15 1 v x 
IEFilter16 1 v x 
IEFilter17 1 v x 


Figure 9: For each BEK benchmark programs, we report 
the number of states in the corresponding symbolic trans- 
ducer. We then report whether the transducer is idempo- 
tent, and whether the transducer is reversible. 


HTMLEncode1 v v v x x v x 
HTMLEncode2 v v v x x v x 
HTMLEncode3 v v v x x v x 
HTMLEncode4 x x x v x x x 
Outsourced1 x x x x v x x 
Outsourced2 v v v x x vo x 
Outsourced3 x x x x x x v 


Figure 10: Commutativity matrix for seven different im- 
plementations of HTMLEncode. The Outsourced imple- 
mentations were written by freelancers from a high level 
English specification. 


project specifically by hiring freelance programmers to 
create implementations from popular outsourcing web 
sites. We provided these programmers with a high 
level specification in English that emphasized protection 
against cross-site scripting attacks. Figure 10 shows a 
commutativity matrix for the HTMLEncode implementa- 
tions. A Vv indicates the pair of sanitizers commute, 
while a X indicates they do not. The matrix contains 12 
check marks out of 42 total comparisons of distinct sani- 
tizers, or 28.6%. Our implementation took less than one 
minute to complete all 42 comparisons. 


4.3 Differences Between Multiple Implementations 


Multiple implementations of the “same” functionality are 
commonly available from which to choose when writing 
a web application. For example, newer versions of a li- 
brary may update the behavior of a piece of code. Differ- 
ent organizations may also write independent implemen- 
tations of the same functionality, guided by performance 
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HTMLEncode1 v v v 0 - v 
HTMLEncode2 v v v 0 — v 0 
HTMLEncode3 v v v 0 - v : 
HTMLEncode4 0 0 0 v 0 0 0 
Outsourced - - - 0 v - 0 
Outsourced2 v v v 0 — v 0 
Outsourced3 0 0 0 0 0 v 


Figure 11: Equivalence matrix for our implementations 
of HTMLEncode. A V indicates the implementations are 
equivalent. For implementations that are not equivalent, 
we show an example character that exhibits different be- 
havior in the two implementations. The symbol 0 refers 
to the null character. 


improvements or by different requirements. Given these 
different implementations, the first key question is “do 
all these implementations compute the same function?” 
Then, if there are differences, the second key question is 
“how do these implementations differ?” 

As described above, because BEK programs corre- 
spond to single valued symbolic finite state transduc- 
ers, computing the image of regular languages under the 
function defined by a BEK program is decidable. By tak- 
ing the image of &* under two different BEK programs, 
we can determine whether they output the 
same set of strings. 

We checked equivalence of seven different implemen- 
tations in C# (as explained above) of the HTMLEncode 
sanitization function. We translated all seven implemen- 
tations to BEK programs by hand. First, we discovered 
that all seven implementations had only one state when 
transformed to a symbolic finite transducer. We then 
found that all seven are neither reversible nor idempotent. 
For example, the ampersand character & is expanded to 
&amp; by all seven implementations. This in turn con- 
tains an ampersand that will be re-expanded on future 
applications of the sanitizer, violating idempotence. 

For each BEK program, we checked whether it was 
equivalent to the other HTMLEncode implementations. 
Figure 11 shows the results. For cases where the 
two implementations are not equivalent, BEK derived 
a counterexample string that is treated differently by 
the two implementations. For example, we discov- 
ered that Outsourced1 escapes the — character, while 
Outsourced2 does not. We also found that one of the 
HTMLEncode implementations does not encode the sin- 
gle quote character. Because the single quote charac- 
ter can close HTML contexts, failure to encode it could 
cause unexpected behavior for a web developer who uses 
this implementation. For example, a recent attack on the 
Google Analytics dashboard was enabled by failure to 
sanitize a single quote [33]. 

This case study shows the benefit of automatic analy- 
sis of string manipulating functions to check equivalence. 
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HTML Attribute 
Implementation context context 
HTMLEncodet 100% 93.5% 
HTMLEncode2 100% 93.5% 
HTMLEncode3 100% 93.5% 
HTMLEncode4 100% 100% 
Outsourcedi 100% 93.5% 
Outsourced2 100% 93.5% 
Outsourced3 100% 93.5% 


Figure 12: Percentage of XSS Cheat Sheet strings, in 
both HTML tag context and tag attribute contexts, that 
are ruled out by each implementation of HTMLEncode. 


Without BEK, obtaining this information using manual 
inspection would be difficult, error prone, and time con- 
suming. With BEK, we spent roughly 3 days total trans- 
lating from C# to BEK programs. Then BEK was able 
to compute the contents of Figure 11 in less than one 
minute, including all equivalence 

and containment checks. 


4.4 Checking Filters Against The Cheat Sheet 


The Cross-Site Scripting Cheat Sheet (“XSS Cheat 
Sheet”) is a regularly updated set of strings that trigger 
JavaScript execution on commonly used web browsers. 
These strings are specially crafted to cause popular web 
browsers to execute JavaScript, while evading common 
sanitization functions. Once we have translated a sani- 
tizer to a program in BEK, because BEK uses symbolic 
finite state transducers, we can take a “target” string and 
determine whether there exists a string that when fed to 
the sanitizer results in the target. In other words, we 
can check whether a string on the Cheat Sheet has a pre- 
image under the function defined by a BEK program. 
We sampled 28 strings from the Cheat Sheet. The 
Cheat Sheet shows snippets of HTML, but in practice a 
sanitizer might be run only on a substring of the snip- 
pet. We focused on the case where a sanitizer is run 
on the HTML Attribute field, extracting sub-strings from 
the Cheat Sheet examples that correspond to the attribute 
parsing context. While HTMLEncode should not be used 
for sanitizing data that will become part of a URL at- 
tribute, in practice programmers may accidentally use 
HTMLEncode in this “incorrect” context. We also added 
some strings specifically to check the handling of HTML 
attribute parsing by our sanitizers. As a result, we ob- 
tained two sets of attack strings: HTML and Attribute. 
For each of our implementations, for all strings in 
each set, we then asked BEK whether pre-images of that 
string exist. Figure 12 shows what percentage of strings 
have no pre-image under each implementation. All seven 
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Figure 13: Self-equivalence experiment. 


implementations correctly escape angle brackets, so no 
string in the HTML set has a pre-image under any of the 
sanitizers. In the case of the Attribute strings, however, 
we found that some of the implementations do not escape 
the string“&#’, potentially yielding an attack. Only one 
of our implementations of HTMLEncode made it impos- 
sible for all of the strings in the Attribute set from ap- 
pearing in its output. Each set of strings took between 36 
and 39 seconds for BEK to check the entire set of strings 
against a sanitizer. 


4.5 Scalability of Equivalence Checking 


Our theoretical analysis suggests that the speed of 
queries to BEK should scale quadratically in the number 
of states of the symbolic finite transducer. All sanitiz- 
ers we have found in “the wild,’ however, have a small 
number of states. While this makes answering queries 
about the sanitizers fast, it does not shed light on the em- 
pirical performance of BEK as the number of states in- 
creases. To address this, we performed two experiments 
with synthetically generated symbolic finite transducers. 
These transducers were specially created to exhibit some 
of the structure observed in real sanitizers, yet have many 
more states than observed in 

practical sanitizer implementations. 


Self-equivalence experiment. We generated symbolic 
finite transducers A from randomly generated BEK pro- 
grams having structure similar to typical sanitizers. The 
time to check equivalence of A with itself is shown in 
Figure 13 where the size is the number of states plus 
the number of transitions in A. Although the worst case 
complexity is quadratic, the actual observed complexity, 
for a sample size of 1,000, is linear. 


Commutativity experiment. We generated symbolic 
finite transducers from randomly generated BEK pro- 
grams having structure similar to typical santizers. For 
each symbolic finite transducer A, we checked commu- 
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Figure 14: Commutativity experiment. 


tativity with a small BEK program UpToLastDot that re- 
turns a string up to the last dot character. The time to 
determine that A o UpToLastDot and UpToLastDot o A 
are equivalent is shown in Figure 14 where the size is the 
total number of states plus the number of transitions in 
A. The time to check non-equivalence was in most cases 
only a few milliseconds, thus all experiments exclude the 
data where the result is not equivalent, and only include 
cases where the result is equivalent. Although the worst 
case complexity is quadratic, the actual observed com- 
plexity, over a sample size of 1,000 

individual cases, was near-linear. 


4.6 From BEK to Other Languages 


We have built compilers from BEK programs to com- 
monly used languages. When the time comes for deploy- 
ment, the developer can compile to the language of her 
choice for inclusion into an application. 

Figure 15 shows a small example of a BEK program 
and the result of its JavaScript compilation. As part of 
the compilation, we have taken advantage of our knowl- 
edge of properties of JavaScript to improve the speed of 
the compiled code. For example, we push characters into 
arrays instead of creating new string objects. The result 
is standard JavaScript code that can be easily included in 
any web application. By adding additional compilers for 
common languages, such as C#, we can give a developer 
multiple implementations of a sanitizer that are guaran- 
teed to be equivalent for use in different contexts. 


5 Related Work 


SANER combines dynamic and static analysis to validate 
sanitization functions in web applications [9]. SANER 
creates finite state transducers for an over-approximation 
of the strings accepted by the sanitizer using static anal- 
ysis of existing PHP code. In contrast, our work focuses 
on a simple language that is expressive enough to capture 
existing sanitizers or write new ones by hand, but then 
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compile to symbolic finite state transducers that precisely 
capture the sanitization function. SANER also treats the 
issue of inputs that may be tainted by an adversary, which 
is not in scope for our work. Our work also focuses on ef- 
ficient ways to compose sanitizers and combine the the- 
ory of finite state transducers with SMT solvers, which 
is not treated by SANER. 


Minamide constructs a string analyzer for PHP code, 
then uses this string analyzer to obtain context free gram- 
mars that are over-approximations of the HTML output 
by a server [26]. He shows how these grammars can 
be used to find pages with invalid HTML. The method 
proposed in [21] can also be applied to string analysis 
by modeling regular string analysis problems as higher- 
order multi-parameter tree transducers (HMTTs) where 
strings are represented as linear trees. While HMTTs al- 


// orginal Bek program 
program test0O(t); 


string s; 

s := iter(c in t) 

{b := false;} { 

case ((c == ’a’)): i 

b := !(b) && b; 
b :=b II b; 
b := !(b); 
yield (c); 

case (true) 
yield (’$’); 

}; 

// 

// JavaScript translation 

// 


function testO(t) f 
function ($){ 
var result = new Array(); 
for (i=0;i<$.length; i++){ 
var c = $[il]; 
if ((c == String.fromCharCode(97))) { 


var S = 


b= (!(b) && b); 
b = (b Il b); 
b = !(b); 
result.push(c) ; 
} 
if (t) { 
result .push (String. fromCharCode (36) ) ; 
} 
3 
return result.join(’’); 


} 

return s(t); 
} 
Figure 15: A small example BEK program (top) and its 
compiled version in JavaScript (bottom). Note the use of 
result.push instead of explicit array assignment. 
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low encodings of finite transducers, arbitrary background 
character theories are not directly expressibly in order to 
encode SFTs. Our work treats issues of composition and 
state explosion for finite state transducers by leveraging 
recent progress in SMT solvers, which aids us in reason- 
ing precisely about the transducers created by transfor- 
mation of BEK programs and by avoiding state space ex- 
plosion and bitblasting for large character domains such 
as Unicode. Moreover, SMT solvers provide a method 
of extracting concrete counterexamples. 

Wasserman and Su also perform static analysis of 
PHP code to construct a grammar capturing an over- 
approximation of string values. Their application is to 
SQL injection attacks, while our framework allows us to 
ask questions about any sanitizer [36]. Follow-on work 
combines this work with dynamic test input generation to 
find attacks on full PHP web applications [37]. Dynamic 
analysis of PHP code, using a combination of symbolic 
and concrete execution techniques, is implemented in the 
Apollo tool [8]. The work in [39] describes a layered 
static analysis algorithm for detecting security vulnera- 
bilities in PHP code that is also enable to handle some 
dynamic features. In contrast, our focus is specifically 
on sanitizers instead of on full applications; we empha- 
size analysis precision over scaling to large code bases. 

Christensen ef al.’s Java String Analyzer is a static 
analysis package for deriving finite automata that charac- 
terize an over-approximation of possible values for string 
variables in Java [13]. The focus of their work is on an- 
alyzing legacy Java code and on speed of analysis. In 
contrast, we focus on precision of the analysis and on 
constructing a specific language to capture sanitizers, as 
well as on the integration with SMT solvers. 

Our work is complementary to previous efforts in ex- 
tending SMT solvers to understand the theory of strings. 
HAMPI [20] and Kaluza [31] extend the STP solver to 
handle equations over strings and equations with mul- 
tiple variables. Rex extends the Z3 solver to handle 
regular expression constraints [35], while Hooimeijer et 
al.show how to solve subset constraints on regular lan- 
guages [17]. We in contrast show how to combine any 
of these solvers with finite transducers whose edges can 
take symbolic values in any of the theories 
supported by the solver. 

The work in [28] introduces the first symbolic ex- 
tension of finite state transducers called a predicate- 
augmented finite state transducer (pfst). A pfst has two 


kinds of transitions: 1) p on q where y and w are char- 


acter predicates or €, or 2) p ces q. In the first case 
the symbolic transition corresponds to all concrete tran- 


sitions p — q such that y(a) and w(b) are true, the 


é i eos a/a 
second case corresponds to identity transitions p —> q 


for all characters a. A pfst is not expressive enough for 
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describing an SFT. Besides identities, it is not possible 
to establish functional dependencies from input to out- 
put that are needed for example to encode sanitizers such 
as EncodeHtml. 

A recent symbolic extension of finite transducers is 
streaming transducers [6]. While the theoretical expres- 
siveness of the language introduced in [6] exceeds that 
of BEK, streaming transducers are restricted to charac- 
ter theories that are total orders with no other operations. 
Also, composition of streaming transducers requires an 
explicit treatment of characters. It is an interesting future 
research topic to investigate if there is an extension of 
SFTs or a restriction of streaming transducers that allows 
efficient symbolic analysis techniques to be applied. 


6 Conclusions 


Much prior work in XSS prevention assumes the correct- 
ness of sanitization functions. However, practical expe- 
rience shows writing correct sanitizers is far from triv- 
ial. This paper presents BEK, a language and a compiler 
for writing, analyzing string manipulation routines, and 
converting them to general-purpose languages. Our lan- 
guage is expressive enough to capture real web sanitizers 
used in ASP.NET, the Internet Explorer XSS Filter, and 
the Google AutoEscape framework, which we demon- 
strate by porting these sanitizers to BEK. 

We have shown how the analyses supported by our 
tool can find security-critical bugs or check that such 
bugs do not exist. To improve the end-user experience 
when a bug is found, BEK produces a counter-example. 
We discover that only 28.6% of our sanitizers commute, 
~79.1% are idempotent, and only 8% are reversibe. We 
also demonstrate that most hand-written HTMLEncode 
implementations disagree on at least some inputs. Un- 
like previously published techniques, BEK deals equally 
well with Unicode strings without creating a state ex- 
plosion. Furthermore, we show that our algorithms for 
equivalence checking and composition computation are 
extremely fast in practice, scaling near-linearly with the 
size of the symbolic finite transducer representation. 
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Abstract 


We address the challenge of building secure embedded 
web interfaces by proposing WebDroid: the first frame- 
work specifically dedicated to this purpose. Our design 
extends the Android Framework, and enables developers 
to create easily secure web interfaces for their applica- 
tions. To motivate our work, we perform an in-depth study 
of the security of web interfaces embedded in consumer 
electronics devices, uncover significant vulnerabilities in 
all the devices examined, and categorize the vulnerabili- 
ties. We demonstrate how our framework’s security mech- 
anisms prevent embedded applications from suffering the 
vulnerabilities exposed by our audit. Finally we evaluate 
the efficiency of our framework in terms of performance 
and security. 


1 Introduction 


Virtually all network-capable devices, including sim- 
ple consumer electronics such as printers and photo 
frames, ship with an embedded web interface for easy 
configuration. The ubiquity of web interfaces can be 
explained by two key factors. For end users, they are easy 
to use because the interaction takes place in a familiar 
environment: the web browser. For device manufacturers, 
providing a web-based interface is cheaper than develop- 
ing and maintaining custom software and installers. 


Though web interfaces are clearly an effective solution 
from a usability perspective, considerable expertise is 
required to make them secure [50]. Our first security 
audit of embedded web interfaces ([7]) provided the 
initial impetus for our work. To underscore the impact of 
these earlier results, we point out that compromising a 
networked device can be used as a stepping stone towards 
compromising the local network [45]. For example, 
compromising a photo frame in an office building 
can lead to an infection of a Web browser connecting 
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to the photo frame. The infection can subsequently 
spread to the entire local network, and also result in 
privacy breaches [8]. For instance a router web interface 
can be exploited to steal remotely the WiFi WPA key 
and gain access to the entire network. Mitigating the 
threats posed by embedded devices, including routers, 
is becoming a critical task, as pointed out repeatedly in 
recent work [7, 45, 19, 27]. In the absence of a reference 
framework for building embedded web interfaces 
each vendor is forced to develop its own stack, which 
usually leads to security problems. This work takes the 
initial studies a step further and proposes a solution 
that uniformly addresses all of the known sources of 
vulnerabilities in embedded web applications. 


We have chosen to build our reference implementation 
as an Android application for several reasons. First, 
Android has quickly become the premier open embedded 
operating system on the market, shipping not only on 
tens of millions of smart-phones every year, but also on 
specialized devices such as the Nook e-book reader by 
Barnes&Noble. Second, Android’s de facto bias towards 
the ARM architecture makes the operating system 
suitable for embedding in other consumer devices such 
as cameras, photo frames, and media hubs. Third, the 
security architecture adopted by Android is particularly 
well-suited for embedded single-user devices as it casts 
the system security question into one of effectively 
isolating concurrent, possibly vulnerable applications. 


Our main contribution in this paper, WebDroid [16], is 
the first open-source web framework specifically designed 
for building secure embedded web interfaces: 


e WebDroid is designed, implemented and evaluated 
based on the knowledge we gained by auditing more 
than 30 web embedded devices’ web interfaces over 
the two last years, and the more that 50 vulnerabili- 
ties we discovered on these devices. 
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e WebDroid is a novel composition of security design 
principles and techniques with a simple and intuitive 
configuration interface where most of the security 
mechanisms are enabled by default—including lo- 
cation and network address restrictions, as well as 
server-side CSP and frame-busting. 


e WebDroid also features application-wide authen- 
tication that ensures that every embedded web 
application will have a secure login and logout 
mechanism which is resistant to attacks, including 
brute-forcing and session hijacking. 


Similar to previous work done on building secure web 
servers (e.g., the OKWS server [29]), our framework 
separates the core web server components from the 
applications to protect against low level attacks. Unlike 
previous systems however, our framework also mitigates 
all of the known application-level attacks including XSS 
(Cross-Site Scripting) [13], CSRF (Cross Site Request 
Forgery) [50], SQL injection [50] and Clickjacking [44]. 


The remainder of the paper is organized as follows: in 
Section 2 we briefly go through the background necessary 
to understand this work. In Section 3 we present and 
categorize the vulnerabilities we found during our audit 
work. Section 4 develops the threat model that we address 
with our system design depicted in Section 5. In Sec- 
tion 6 we highlight the main defense mechanisms that are 
employed in our implementation. Section 7 presents the 
user interface for managing web applications. Section 9 
discusses two application case studies and describes how 
WebDroid security mechanisms help to mitigate vulnera- 
bilities. In Section 10 we provide a summary of relevant 
related work, and Section | 1 concludes the paper. 


2 Background 


The embedded device market is growing rapidly. For 
example, in the 4th quarter of 2008, 7 million digital 
photo frames were sold, almost 50% more than in the 4th 
quarter of 2007. Similarly, analysts forecast that by 2012, 
12 million Network Attached Storage (NAS) devices 
will be sold each year. At the current pace, devices with 
embedded web servers will outnumber traditional web 
servers in less than 2 years; Netcraft reported that there 
are roughly 40 millions active web servers on the Internet 
in June 2009 [35]. 


In order to differentiate their products from those of 
their competitors, vendors are constantly adding novel 
features to their products, such as BitTorrent support in 
NAS devices. 
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As the number of features increases, a need for a 
powerful management interface on the device rapidly 
arises. To offer this in an intuitive, convenient, and 
cost effective way, vendors have started to embed web 
interfaces in their products. While the most well known 
use of these web interface is to configure network 
equipments such as WiFi access points and routers, 
many other embedded devices include web interfaces. 
For instance digital photo frames are an excellent 
example of this expansion of features and need for a rich 
configuration interface. Thus, it is safe to say that web 
interfaces have become the norm in managing embedded 
devices. 


Our audit uncovered abundant examples of features 
that were hastily implemented and vulnerable to web at- 
tacks. For example the Flickr integration in digital photo 
frames led to XSS attacks. What is especially trouble- 
some is the fact that we found CSRF exploits in managed 
network switches aimed for datacenter use. Attacks on 
such devices could allow remote users to reboot them and 
effectively DoS an entire company intranet in one step. 


Samsung Photo Frame Web Configuration 


Photo Frame 


test 
Now Playing: APG 


Frame Seria Number: 839¢c 128-6347-d00-ese-4826 12/80D40 
Fem ‘N-CB0BS6LS-1001.1 





Figure 1: The web interface embedded into a Samsung 
photo frame. 


Figure | is a screenshot of the interface embedded in 
a high-end Samsung photo frame. This interface allows 
the user to control the frame’s display remotely, add an 
Internet photo feed to be displayed on the frame, and 
to find out various statistics. Although at first sight this 
interface looks perfectly designed, we found out that in 
reality it is completely flawed: for example, it is possible 
to bypass the authentication process to view photos and 
it is possible to inject an exploit via a CSRF and XSS 
vulnerability that allows to extract photos and send them 
to a remote server. 
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3 Embedded Web Application Security: 
State of the Art 


Over the last two years we audited the web interfaces 
for more than 30 embedded devices. In this section we 
report our audit results and discuss the insights we gained 
from them. These results and insights are later used to 
justify and guide the design of our framework security 
features. Note that although we discussed some of the 
vulnerabilities we found in a previous publication [8], this 
is the first time that the complete audit results are reported 
and discussed. 


3.1 Audit coverage 


The eight categories of devices we tested are: lights-out 
management (LOM) interfaces (these typically allow the 
administrator to power cycle a PC or control network ac- 
cess, bypassing the OS), NAS (used for shared storage 
accessible via Ethernet), photo frames (we focused on 
“smart” frames with network connectivity), routers/access 
points (probably the most familiar browser-managed class 
of consumer device), JP cameras (with video feeds that 
can be accessed over the network), /P phones (especially 
those with a web-based management interface), switches 
(“managed switches” that expose some configuration op- 
tions), and printers (the larger ones usually have a HTTP- 
based interface used to configure a variety of functions, 
including access via e-mail). The eight device categories 
spanned seventeen brands: Table | shows which types 
of devices were tested for each brand. As one can see 
we did test devices from vendors specialized in one type 
of product such as Buffalo, and from vendors that have a 
wide range of products such as D-link. 


3.2 Vulnerability classes 


XSS. As a warm-up we started by testing for Type 2 
(stored) cross-site scripting (XSS) vulnerabilities [13], 
which are common in web applications. Most devices 
are vulnerable, including those that perform some input 
checking. For example, the TrendNet switch ensures that 
its system location field does not contain spaces, but does 
not prevent attacks of the form: 


loc") ; document .write("<script/src= 
‘http://evil.com/a.js’></sc"+"ript>. 


XSS attacks are particularly dangerous on embedded 
devices because they are the first step toward a persistent 
reverse XCS, as discussed below. 
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CSRF. Cross-site request forgery [50] enables an attacker 
to compromise a device by using an external web site as 
a stepping stone for intranet infiltration. On embedded 
devices it can also be used as a direct vector of attack as it 
allows the attacker to reboot critical network equipments 
such as switches, IP phones and routers. Finally we used 
CSRF as a way to inject Type 2 (stored) XSS and reverse 
XCS [9] payloads. 


File security. For each device, we checked whether it was 
possible to read or inject arbitrary files. Some devices, 
such as the Samsung photo frame, allow the attacker 
to read protected files without being authenticated. On 
this device, even when the Web interface was protected 
by a password, it was still possible to access the photos 
stored in memory by using a specially crafted URL. On 
other devices, the Web interface could be compromised 
by abusing the log file. 


User authentication. Most devices have a default pass- 
word or no password at all. Additionally, most devices 
authenticate users in cleartext (i.e. without HTTPS). This 
was even true for several security cameras, which is sur- 
prising given that they are intended to securely monitor 
private spaces. We even found that some NAS and photo 
frames do not properly enforce the authentication mecha- 
nism and it is possible to access the user content (i.e. pho- 
tos) without being traced in the logs. Similarly, nothing is 
done at the network level to prevent session hijacking as 
the traffic is in clear and the cookies are sent over HTTPS. 
Finally as far as we can tell not a single device implements 
a password policy or an anti-brute force defense. 


Clickjacking attacks. Clickjacking attacks [18] are the 
most recent, and most overlooked attack vectors as all 
devices were vulnerable to them. While at first sight this 
does not appear to be a big issue, it turns out that being 
able clickjack an embedded interface gives a lot of lever- 
age to the attacker. For example basic Clickjacking can 
be used to reboot devices, erase their content and in the 
case of routers, enable guest network access. Advanced 
Clickjacking [49] as demonstrated by Paul Stone at Black- 
Hat Europe 2010 allows the attacker to steal the router 
WPA key or the NAS password. 


Altermate 
Channels 


> 
——_> , 

attacker Device 
—_> 


—> 


Web 


Figure 2: Overview of an XCS attack. 
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Table 1: List of devices by brand. 


XCS. A Cross-Channel Scripting attack [9] comprises 
two steps, as shown in Figure 2. In the first step the 
attacker uses a non-web communication channel such as 
FTP or SNMP to store malicious JavaScript code on the 
server. In the second step, the malicious content is sent 
to the victim via the Web interface. XCS vulnerabilities 
are prevalent in embedded devices since they typically 
expose multiple services beyond HTTP. XCS bugs often 
affect the interaction between two specific protocols only 
(such as the combination of HTTP and BitTorrent), which 
can make them harder to detect. 


Reverse XCS. In a Reverse XCS attack the web interface 
is used to attack another service on the device. We 
primarily use reverse XCS attacks to exfiltrate data that is 
protected by an access control mechanism. 


We did not look for SQL injections [21], as it was un- 
likely that the audited devices would contain a SQL server. 
However we still consider SQL injection attack to be a 
potential threat and therefore our framework has security 
mechanisms in place to mitigate them. Finally, while in 
some cases we found weaknesses in the networking stack 
(for example: predictable Initial Sequenced Numbers), 
we do not discuss that topic here. 
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3.3 Tools used 


The audit of each device was done in three phases. First, 
we performed a general assessment using NMap [31] and 
Nessus [42]. Next, we tested the web management inter- 
face using Firefox and several of its extensions: Firebug 
[20], Tamper Data [26], and Edit Cookies [51]. We used 
a custom tool for CSRF analysis. In the third phase we 
tested for XCS using hand written scripts and command 
line tools such as smbclient. 


3.4 Audit results 


Table 2 summarizes which classes of vulnerabilities 
were found for each type of device. We use the 
symbol Llwhen one device is vulnerable to this class of 
attacks and Hlwhen multiples devices in the class are 
vulnerable. The second column from the left indicates 
the number of devices tested in that category. We sur- 
vey the most interesting vulnerabilities in the next section. 














Table 2 shows that the NAS category exhibits the 
most vulnerabilities, which can be expected given the 
complexity of these devices. We were surprised by the 
large number of vulnerabilities in photo frames, which 
are relatively simple devices. 
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Table 2: Vulnerability classes by device type. 


A possible explanation is that vendors rushed to market 
in order to grab market share with new features. Indeed, in 
the Kodak photo frame, half the Web interface is protected 
against XSS while the other half is completely vulnerable. 
IP cameras and routers are more mature, and therefore 
tend to have a better security. Table 2 also shows that 
even enterprise-grade devices such as switches, printers, 
and LOM are vulnerable to a variety of attacks, which 
is a concern as they are usually deployed into sensitive 
environments such as server rooms. 


4 Threat Model 


Our audit showed that embedded web management inter- 
faces pose a serious security threat and are currently one 
of the weakest links in home and office networks. In this 
section we formalize our attacker model and the security 
objectives that our framework aims at achieving. 


4.1 Attacker model 


In this paper, we are concerned with securing embedded 
web interfaces from malicious attackers. Inspired by the 
threat model of [6] we are using the ’web attacker” con- 
cept with slightly more powerfully attacker as we allow 
the attacker to interact directly with the web framework 
like in the active attacker model. Accordingly our attacker 
model is defined as follows: we assume an honest user 
employs a standard web browser to view and interact with 
the embedded web interface content. Our malicious web 
attacker attempts to disrupt this interaction or steal sen- 
sitive information such as a WPA key. Typically, a web 
attacker can attempt to do this in two ways: by trying 
to exploit directly a vulnerability in the web interface, 
or by placing malicious content (e.g. JavaScript) in the 
user’s browser and modifying the state of the browser, 
interfering with the honest session. We allow the attacker 
to attempt to directly attack the web framework in any 
way he likes; in particular, we assume that the attacker 
will attempt to DDOS the web server, find buffer overflow 
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exploits or brute force the authentication. Finally, we also 
assume that the attacker will be able to manipulate any 
non-encrypted session to his advantage. 


4.2 Security objectives 


Based on our audit evaluation and the attacker model 
described above we now formalize what security objec- 
tives our framework aims at achieving. These goals fall 
into four distinct umbrella objectives that cover all of the 
known attacks against a web interface. 


Enforcing access control. The first goal of our frame- 
work is to ensure that only the right principals have access 
to the right data. Access control enforcement needs to be 
enforced at multiple levels. First, at the network level, our 
framework needs to ensure that the web interface is only 
available in the right physical or network location and to 
the right clients. At the application level, it means that 
the framework needs to ensure that every web resource 
is properly protected and that the attacker can not brute- 
force user passwords. Finally, at the user level it also 
means that the framework offers to the user the ability to 
declare whether a specific client is allowed to access a 
given web application. 


Protecting session state. Protecting session state ensures 
that once a session is established with the framework, 
only the authenticated user is accessing the session. At 
the network level, protecting the session state implies 
preventing man in the middle attacks by enforcing the 
use of SSL. At the HTTP level, protecting the session 
means protecting the session cookies from being leaked 
over HTTP (as in the Sidejacking attack) or being read 
via JavaScript (XSS). 


Deflecting direct web attacks. Deflecting direct web 
attacks requires that our framework is not vulnerable to 
buffer overflow or at least that the privileges gained in case 
of successful exploitation are limited. At the application 
level, the framework must be able to mitigate XSS [13], 
and SQL injection attacks [21]. 
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Preventing web browser attacks. In order to prevent 
web browser attacks, the framework has to work with the 
browser to ensure that the attacker cannot include in a web 
site a piece of code (such as an iframe or JavaScript) that 
can abuse the trust relation between the browser and the 
web interface. These attacks are instances of the confused 
deputy problem [6]. They include CSRF and Clickjacking 
attacks. 


5 System Overview 


In this section we discuss the design principles behind our 
framework, provide an overview of how the framework 
works and describe how a web request is checked and 
processed. 


5.1 Design principles 


To address the threat model presented in the previous sec- 
tion, our framework is architected around the following 
four principles: 


Secure by default. The team in charge of building an 
embedded web interface is usually not security savvy 
and is likely to make mistakes. To cope with this lack 
of knowledge our framework is designed to be secure 
by default, which means that every security feature and 
check is in place and it is up to the developers to make 
them less restrictive or turn them off. For instance, our 
default CSP [14] (content security policy) only allows 
content from self, which means that no external content 
will be allowed to load from a page in the web interface. 
Similarly the framework uses whitelists for input filtering: 
by default only a restricted set of characters is allowed 
in URL parameters and POST variables, and it is up to 
the developer to relax this whitelist if needed. As a final 
example, the framework injects JavaScript frame-busting 
code and the X-Frame-Option header in all the pages 
in order to prevent Clickjacking attacks. In the unlikely 
situation where the interface needs to be embedded in 
another webpage, the developer must turn the defense 
mechanism off. 


Defense in depth. Since there is no universal fix for many 
types of attacks, including XSS, CSRF, and Clickjacking, 
our framework follows the defense in depth principle and 
implements all the known techniques to try and mitigate 
each threat as much as possible. We perform filtering and 
security checks at input, during processing, and during 
output. 


Least privilege. Following the OKWS design [29], we 
implement the least privilege principle by leveraging the 
Android architecture. Each application and the frame- 
work have separate user IDs and sets of permissions; this 
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guarantees that if the framework or one of the applica- 
tions is compromised, the attacker will not take complete 
ownership of the data. For instance by taking over the 
framework one does not gain access to the phone contacts 
list used by one of the applications: our framework only 
has the network privilege. Note that the application de- 
veloper must modularize his or her application to fully 
benefit from the least privilege design. Product features 
that can significantly modify device functionality, such 
as by executing a firmware upgrade, need to receive spe- 
cial consideration as well perhaps resulting in additional 
backend checks performed in advance. 


User consent. Our last design principle is ’user consent 
as permission”: we let the user make the final decisions 
about key security policies. For example, when a new 
web client wants to access one of the phone web applica- 
tions, it is up to the user to allow this or not because only 
she knows if this request is legitimate. Similarly, when 
the user installs a new web application, she is asked if 
she wants to be prompted for approval each time a client 
connects to that application. Finally, at install time we 
also provide the user with a summary of the security fea- 
tures that have been disabled. The user can then decide if 
the presented security profile is acceptable or not. While 
users can generally not be relied on for ensuring system 
security, we implement the user consent principle in or- 
der to catch potential security issues that clearly defeat 
common sense. 


5.2 Server architecture 


As shown in Figure 3, the framework is composed of four 
blocks and architected like the iptables firewall with a 
series of security checks performed at input time, and 
another series during output. 


The Dispatcher is responsible for forwarding an HTTP 
request to the desired application. The forwarding 
decision is based on the unique port number assigned to 
every application. Separating applications by port number 
allows greater granularity for doing data encryption 
which is specific to every application. In addition to 
forwarding, the Dispatcher is also responsible for policy 
based enforcement of security mechanisms. 


The Configuration Manager handles per-application 
tuning of the security policies. When an application 
is first registered with the web server, all the security 
mechanisms are turned on by default. The administrator 
can then enable or disable individual mechanisms using 
the configuration interface. The resulting configuration 
is captured in a database and made available to the 
Dispatcher for policy enforcement. 
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Figure 3: Overview of the framework design showing 
the interaction of the different web server components 
(dispatcher, applications, and alert system) involved in 
the processing a client request. 


The Alert System is used to control how the adminis- 
trator is to be notified for different events. For instance, 
the administrator may want to be explicitly alerted for 
every new client connection. The Alert System also 
handles notifications caused by malicious web requests 
as detected by the Dispatcher. Notifications can either 
be passive or active depending on whether they need 
approval from the administrator. 


Finally, the framework also provides an API for effi- 
ciently implementing web applications. The core func- 
tionality includes methods to handle HTTP requests and 
generate the response. It also provides handlers with 
build in security mechanisms for content generation such 
as HTML components, CSS, JavaScript, JSON etc. For 
instance, the HTML, XML and JSON handlers provide pa- 
rameterized functions required to escape dynamic content 
before being added to the rendered page. In addition, the 
framework provides methods for allowing applications to 
construct HTTPOnly or secure cookies. 
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5.3. Request processing 


As depicted in Figure 3 a new web request goes through 
a series of input security checks and processing, and is 
subsequently forwarded to the actual application. The 
response generated is subjected to another iteration of 
checks and processing before being sent to the client. 
If any check fails then the processing is aborted and a 
notification is sent via the Alert System. 


The pre-processing step performs two rounds of 
security checks. First, the origin of the request is 
compared to the client restriction policy in order to block 
queries coming from unwanted sources. Second, the 
HTTP query is validated through regular expression 
whitelists. The corresponding web application is then 
identified (based on the port number) and the session and 
CSRF tokens validation checks can be done. 


After validation, the request is sent to the web appli- 
cation which generates a page using our framework and 
sends it back to the web server. Before reaching the 
network, the response is passed through post-processing 
security mechanisms like S-CSP and CSRF token gener- 
ation. This usually results in the inclusion of additional 
headers and modification of certain HTML elements. The 
result is then returned to the client. 


6 Security Mechanisms 


A broad range of mechanisms and best practices have 
been developed over the last few years to counter the 
most severe web security problems. It is clear that no sin- 
gle technique or framework will make a web application 
secure. In addition, expecting developers to understand 
and deploy all of these mechanisms on their own is unreal- 
istic. Table 3 maps the mechanisms that we embed in our 
secure web server implementation against the threats they 
are designed to mitigate. We now describe each security 
mechanism and provide further references. Note that in 
many scenarios we depend on a correct browser imple- 
mentation for security capabilities. Wherever possible, 
we use additional mechanisms that can add security even 
if the browser is not up-to-date or compliant. 


HTTPOnly cookies. Many XSS vulnerabilities can be 
mitigated by reducing the amount of damage an injected 
script can inflict. HTTPOnly cookies [33] achieve this 
by restricting cookie values to be accessible by the server 
only, and not by any scripts running within a page. In 
practice, most cookies used in web application logic are 
inherently friendly to this concept, and this is why we 
have chosen to build it in. (HTTPOnly cookies are not 
implemented by Android HttpCookie.) 
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Table 3: Threats and corresponding security mechanisms 


Server-side input filtering. Even though filtering or 
whitelisting of user input can fail if implemented incor- 
rectly [3, 2, |], it is still very important to sanitize user 
data before web pages are rendered with it. Input filtering 
can prevent scripting exploits as well as SQL injections. 
When applied to data coming from other embedded ser- 
vices, input filtering can also prevent many XCS attacks. 


CSP (Content Security Policy). Pages rendered by the 
typical embedded web application have little need to con- 
tact external web sites. Correspondingly our server is con- 
figured to offer restrictive CSP [14] directives to browsers, 
limiting the impact of any injected code in the page. 


S-CSP (Server-side Content Security Policy). For 
browsers that do not support CSP, we introduce Server- 
side CSP. While rendering a particular site, the server 
looks at the CSP directives present in the header (or the 
policy-uri) and modifies the HTML code accordingly. In- 
stead of standard input filtering, the changes are based on 
the custom policies defined by the administrator: such as 
valid hosts for the different HTML elements, use of inline- 
scripts, eval functionality usage and so on. Its novelty lies 
in the fact that the resulting HTML page as received by 
the browser automatically becomes CSP compliant. In 
addition to filtering, S-CSP can also support reporting of 
CSP violations via ’report-uri’ directive which ordinarily 
is not possible for incompatible browsers. 


X-Frame-Options. Clickjacking is a serious emerging 
threat which is best handled by preventing web site fram- 
ing. Since embedded web applications are usually not 
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designed with mash-up scenarios in mind, setting the 
option to DENY is a good default configuration. 


JavaScript frame-busting. Not all browsers support the 
X-Frame-Options header, and therefore our framework 
automatically includes frame-busting code in JavaScript. 
The particular piece of code we use is as simple as possi- 
ble and has been vetted for vulnerabilities typically found 
in such implementations [44]. 


Random anti-CSRF token. Cross-site request forgery 
is another web application attack which is easy to prevent, 
but often not addressed in embedded settings. Our frame- 
work automatically injects random challenge tokens in 
links and forms pointing back at the web application, and 
checks the tokens on page access [39]. 


Origin header verification. Along with checking CSRF 
tokens, we make sure that for requests that supply any 
parameters (either POST or GET) and include the Ori- 
gin [5] or Referer header, the origin/referer values are 
as expected. We do this as a basic measure to prevent 
cross-site attacks. When the Referer header is available, 
we also check for cross-application attacks, making sure 
that each application is only accessed through its entry 


pages. 


SSL. Securing network communications often ends up 
being a low-priority item for application developers, and 
this is why our web server uses HTTPS exclusively by 
default, with a persistent self-signed certificate created 
during device initialization. 
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HSTS (HTTP Strict Transport Security) and Secure 
cookies. In addition to supporting SSL out of the box, 
our server implements the HSTS standard [22] and re- 
quests that all incoming connections be over SSL, which 
prevents several passive and active network attacks [23]. 
Moreover, browser cookies are created with the Secure 
attribute, preventing the browser from leaking them to the 
network in plaintext. 


Parametrized rendering and queries. Android already 
supports parametrized SQLlite queries [52] and we en- 
courage developers to make use of this facility. We have 
also added the ability to parametrize dynamic HTML ren- 
dering, in which case escaping of the output is performed 
automatically. 


URL scanning. Incoming HTTP requests are sani- 
tized by applying filtering similar to that offered by the 
URLScan tool in Microsoft IIS [34]. Our filter is config- 
ured to restrict both the URL and query parts of a request, 
while changes by the web application developer are al- 
lowed if necessary. URLScan is most useful in preventing 
web application vulnerabilities due to incorrect or incom- 
plete parsing of request data. 


Application-wide authentication, password policy, 
and password anti-bruteforcing. Recognizing that user 
authentication is often a weak spot for web applications, 
we have implemented user authentication as part of the 
web server, freeing the developers from the need to im- 
plement secure user session tracking. In addition, the 
password strength policy can be changed according to 
requirements, and a mechanism to prevent (or severely 
slow down) brute-force attacks is always enabled. 


Network restrictions. Most embedded web servers have 
a relatively constrained network access profile: either the 
device should serve requests only when connected to a 
specific network or WiFi SSID, or the hosts requesting 
service might match a profile, such as a specific IP or 
MAC address. This feature, while easily accessible, can 
not be configured by default due to the differences in 
individual application environments. 


Location restrictions. Similar to network restrictions, 
the server can be configured to operate only when the 
device is at specific physical locations, minimizing the 
opportunities for an attacker to access and potentially 
compromise the system. 


DDoS. While distributed denial-of-service (DDoS) pro- 
tection is difficult, we believe that much can be done to 
mitigate such threats. For most applications, maintaining 
local service is of top priority, and so we throttle HTTP 
requests such that those coming from the local network 
always have a guaranteed level of service. Of course, this 
can not prevent lower-level network DDoS attacks: these 
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have to be taken care of separately, outside of the web 
server. 


7 User interface 


This section briefly describes the user interface required 
for basic administration of the web server and security 
policy management. In the following description, we refer 
to the owner of the smart phone or embedded device as 
the Admin user. 


7.1 Configuration management 
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Figure 4: Main web server configuration interface. 


This interface is used to control the server settings 
across all the applications. As shown in Figure 4, it pro- 
vides the ability to disable each web application. It also 
displays the web server overall statistics such as the num- 
ber of active application and the number of active connec- 
tions session. 


Web server logs. Accessible from the menu options, 
the logged events such as failures, new connections and 
configuration changes can be visualized. 


Settings. From this interface, the Admin overrides some 
security features in order to enforce certain mechanisms 
for all applications, irrespective of their individual config- 
uration. 
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Figure 5: Web application configuration interface, al- 
lowing per-application customizations (secure settings 
highlighted in green). 


7.2 Configuration per web application 


This interface enables the Admin user to control some 
web application parameters such as the port number, the 
application name, and its password or tune the security 
policy for every application. As shown in Figure 5, it dis- 
plays the name, path, security level and status information 
along with the currently enabled security mechanisms. 
Since all the mechanisms are turned on by default, policy 
administration is not strictly necessary. However, this 
allows flexibility in the framework that can be useful in 
special circumstances. For instance, the Admin user may 
wish to disable the heavy S-CSP mechanism in the case 
of a restricted set of trusted users. The different function- 
alities provided by the interface are described below. 


Alarm system configuration. Each new client connec- 
tion request can be monitored by setting the alarm noti- 
fication level to one of the three possibilities: Disabled, 
Passive, or Approval. Both Passive and Approval notifi- 
cations alert the administrator about the new connection. 
Approval mode has the additional feature of requiring the 
Admin user to grant access before proceeding. 


Network and location restriction. The web server can 
restrict clients connecting based on the network properties 
(serving WiFi or 3G only for example) or based on the 
current location such as home or office. 


Domain whitelist. The Admin can define a list of do- 
mains that are allowed in the CSP policy by writing a 
comma separated list of domains/IP addresses. If this 
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<WebServerConf> 

<WebApp> 

<path>com. android.websms</path> 
<Enabled>1</Enabled> 
<CSRF>1</CSRF> 
<HttpOnlyCookie>1</HttpOnlyCookie> 
<XFrame>1</XFrame> 

</WebApp> 

</WebServerConf> 


Figure 6: Web server configuration sample 


field is empty, the web server will enforce the restrictive 
*allow self’ policy and block all other sources. 


IP whitelist. The Admin user can explicitly allow access 
for a specific set of trusted hosts by adding a comma- 
separated list of IP addresses. For a new connection re- 
quest, if the source IP is in this list then access is permitted 
regardless of the restrictions described above. 


7.3 Configuration without the UI 


For embedded devices without a display to access the 
configuration interface, the web server can be configured 
through an XML file present in the application package 
as a raw resource. With this file, the web server adminis- 
trator can enforce security mechanisms for specific web 
applications or disable all web application that do not 
respect some requirements. The web server configura- 
tion can also be done after installation by modifying the 
SQLite database on the device. 


8 Implementation 


In this section we describe how our system is imple- 
mented and how Android applications interact with 
it. Our system consists of two main components: the 
Dispatcher (a web server that processes and routes 
requests to applications) and our framework API that 
Android applications can access. 


The Dispatcher works as an Android background 
service. As a Starting block we used the Tornado 
Open-source web server that we hardened and mod- 
ified to work with our framework. The web server 
follows the least privilege principle, and runs with 
the minimal permissions set needed to handle HTTP 
communications: android.permission.INTERNET. 
To be allowed to expose a web interface, an appli- 
cation requests a new permission that we created 
called com.android.webserver.WEB_APPLICATION. 
This novel permission is more restrictive than an- 
droid.permission.INTERNET and only allows the 
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mountWebContent ("websms", 
Home.class); 
mountWebContent ("websms/send", 
SendSMS.class); 
mountWebContent ("websms/view", 
SMSHistory.class) ; 
mountWebContent ("websms/theme.css", 
RawRessource.class, 
RawRessource.CSS, 
R.raw.hello); 


Figure 7: WebSMS code used to declare the exposed web 
interface. 


application to serve web requests via the dispatcher. 


At launch time the Dispatcher browses the list of in- 
stalled applications for new ones requesting the web ap- 
plication permission. By retrieving the ContentProvider 
associated to the framework, it queries the security con- 
figuration. Following the consent as permission principle 
we prompt the user every time a new web application 
wants to register. When an application set the same URL 
path than another one, the registration is discarded and a 
possible malicious application warning is displayed to the 
user. 

The framework API is a Java library that handles com- 
munications between the web server and the web appli- 
cation (which run as separate processes). It also provides 
a Set of classes that help generating web content. Simi- 
larly to many modern web framework (i.e. Rails), every 
web page need to registered it web path through a func- 
tion call, in our case this function is mountWebContent. 
This function bind a path to a java class entry point. For 
example our WebSMS web application register 4 web 
pages: 3 HTML pages and | CSS stylesheet (Figure 
7). Note the use of the RawRessource.class which al- 
lows developer to expose directly raw data to the web 
such as CCS files. Our framework provides a set of 
classes to help building HTML pages, or handling other 
resources request such as pictures, CSS stylesheets or 
JavaScript libraries. The java classes Home, SendSMS 
and SMSHistory extends the framework class HTML- 
Page which provides various methods to add dynamic 
content to the pages. In particular the HTMLPage class 
has the method appendHTMLContent (content, 
String[] vars) that allows to programmatically ap- 
pend content to the page. Text variables are represented 
by $ which are substituted by the corresponding var string 
after it is filtered to prevent XSS. While the authors can 
bypass the filtering process if they want by default it is 
in place. Similarly, the HTMLPage class ensures that 
the data passed to the application is properly sanitized 
and that parametrized SQL queries are used in order to 
prevent SQL injection. 


USENIX Association 


When an HTTP request is received, it goes through all 
pre-processing security mechanisms and is dispatched to 
the corresponding web application. The framework API 
embeds an Android ContentProvider used by the web 
server to query pages. HTTP headers, body and security 
tokens are added to the query and then transmitted to the 
web application. Using the framework API, the web page 
is build and send back as answer to the query. This one is 
finally checked by all post-process security mechanisms 
and send back to the web client. 


9 Case Studies 


In this section we present two case studies that demon- 
strate how our framework effectively mitigates web 
vulnerabilities. We describe the applications we built, 
their attack surface, how the framework protects them, 
and finally show that when using off-the-shelf security 
scanners the framework is indeed able to mitigate the 
vulnerabilities found in the apps. 


To study the effectiveness of our the system we built 
two sample applications that take advantages of the 
phone’s capabilities to provide useful services: the first 
one, WebSMS, is used for reading and sending SMS from 
the browser; the second one, WebMedia, provides a con- 
venient web interface to browse and display the photos 
and videos stored on the smartphone. We argue that these 
two applications—while limited—are good case studies 
of what developers might want to built in order to leverage 
a device’s capabilities in the form of web applications. 


9.1 Applications 


WebSMS. When loaded in a client browser, the user can 
choose to view the current SMS inbox or send a new one. 
For the second choice, the application displays a list of 
contacts fetched from the phone’s directory along with a 
search box. Clicking on a particular contact allows to send 
a SMS directly from the browser. The SMS content is 
sent by the browser to the application via a POST request 
that contains the contact ID. 


WebMedia. This application displays a gallery of photos 
and videos stored on the Android device (Figure 8). When 
a thumbnail is clicked, a full size view of the media file 
is displayed. The application provides a convenient way 
to display photos and videos to friends and family on a 
big screen. In addition, this application enables seamless 
sharing of content with trusted users (friends or family). 
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Figure 8: The WebMedia embedded web application. 


9.2 Attack surfaces 


Without framework support, the web applications suffer 
from multiple vulnerabilities. In the WebSMS application, 
the contact search can be a vector for reflected XSS or 
SQL injection. Also, the capacity to send message and 
view their contents afterward can lead to a stored XSS in 
the sending and in the receiving phone. The WebMedia 
application is vulnerable to CSRF attacks as well. The 
XSS attack allows the attacker to steal private information 
as the contact list of the sent and received SMS contents. 
A CSREF can be conducted to send SMS on behalf of 
the user, which can lead to embarrassing situations or 
financial loss. In extreme cases, if the phone is used as 
a trusted device to authorize sensitive operations such as 
bank transfers, then the combination of XSS and CSRF 
attacks will allow a malicious user to bypass this security 
mechanism and conduct fraudulent operations. 


9.3 Security evaluation 


In order to evaluate whether our framework is able to 
mitigate the attacks against our vulnerable applications 
we have run the web scanners Skipfish and Nexpose 
against our applications with the framework defense 
mechanisms off and then on. When the framework 
defenses are turned off, both Skipfish and Nexpose 
detected reflected XSS and stored XSS vulnerabilities in 
the WebSMS application. When the framework defenses 
are turned on, no vulnerabilities are reported. Note that 
neither scanner reported the CSRF vulnerabilities. 
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Figure 9: Average number of request per second with and 
without security features enabled. 


This limited experiment shows that our framework can 
help effectively and transparently mitigate vulnerabilities 
that may exist in embedded web interfaces even though 
it can not completely replace good coding practices and 
careful code review. 


9.4 Performance evaluation 


While as stated earlier performance should not be the 
focus of a mobile web framework, we still ran a basic 
performance evaluation using the Apache benchmark 
tool to evaluate the impact of enabling security features 
on WebDroid performance. To reflect as accurately as 
possible real world usage, we ran these benchmarks over 
WiFi with WebDroid on a standard HTC Desire phone 
with Android 2.3. We were not able to test over 3G as IP 
are not routable. 


WebDroid performance in term of requests per second 
for the WebSMS application when the number of simul- 
taneous connections increase is reported in figure 9. The 
figure 10 depicts how fast WebDroid is able to process 
each request as the number of simultaneous connections 
increase. As visible in the diagrams, WebDroid take be- 
tween a 10% to 30% performance hits when the security 
features are turned one depending on the number of simul- 
taneous connections. On average WebDroid performance 
take a 20% hit when the security features are enabled. 
While this performance hit might not be acceptable for a 
regular website, for an embedded interface we argue that 
it is acceptable as even when there are 128 simultaneous 
connections, WebDroid is able to serve every request in 
less than 80 ms which is below what is the optimal user 
tolerance time: 100ms [37]. 
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Figure 10: Average time to process a request with and 
without security features enabled. 


10 Related Work 


Browser defenses. Mozilla Foundation’s Content Secu- 
rity Policy (CSP) [14] proposal allows a site to specify 
restrictions on content served from the site, including 
which external resources the content can load. The CSP 
policy is specified as an HTTP header in each HTTP 
response. For example, the CSP header 


X-Content-Security-Policy: allow self 


prevents the content from loading any external resources 
or executing inline scripts. Replacing “allow self” 
with “allow whitelist” allows external resources from 
the given whitelist. Another system, SiteFirewall [9], 
takes a similar approach but also allows persistent 
browser-side policy storage (via cookies or other, more 
secure objects). SiteFirewall is capable of blocking 
some types of XCS attacks from being completed. 
The system uses a browser extension that acts as a 
firewall between vulnerable, internal web sites, and 
those accessed by the user on the open Internet. A third 
proposal called SOMA [38] implements a mutual consent 
policy on cross-origin links. That is, both the embedding 
and the embedded content must agree to the action 
being initiated. As with CSP, SOMA is implemented 
as a content-specific policy rather than a global site 
policy. Finally Content Restrictions [32] is another 
approach to defining content control policies on web sites. 


Frameworks. Generic web frameworks, such as Ruby 
on rails [41] and Django, implement numerous features 
such as built-in CSRF defenses that help developers to 
build secure web interfaces more easily. However this 
kind of generic framework is very heavy and therefore 
not suitable for being used in embedded devices. We are 
not currently aware of any framework specially designed 
for embedded devices. Additionally, while designed with 
security in mind, these frameworks do not make secure 
web application design intuitive for the developer. 
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In contrast, we strive for a secure by default system 
where a developer has to do little if anything in order to 
build a secure web application. 


Web servers. At the process level, flow control en- 
forcement such as the one presented in Histar [54], As- 
bestos [11] and Flume [30] can be used to achieve some 
of our goals such as document sanitization. The Android 
OS [15] capability model can also be extended to enforce 
network restrictions. As far as we know, none of the 
lightweight web servers like Tornado [| 2] were built with 
the objective of enforcing security principles. Previous 
work on security centric web servers such as [29] were 
only designed to mitigate low level attacks by enforcing 
privilege separation. None of them offered a framework 
to mitigate web vulnerabilities. 


Other related work. The log injection attack, a simple 
form of XCS, has been known for several years [47], 
most notably in the context of web servers resolving 
client hostnames. Recently, CSRF and XSS attacks have 
attracted much attention, including work on various 
defense techniques [6]. NAS security has been a topic for 
discussion since the early days of networked storage [10]. 


IP telephony security has also been scrutinized. How- 
ever this has only been done for specific protocols, not 
for complete systems [48]. Most other work in web 
security[13, 24, 4, 17, 25, 28, 43, 32, 36, 40, 53, 46] has 
focused on web servers on the open Internet, as opposed 
to devices on private intranets, which are the topic of this 
work. 


11 Conclusion 


We present WebDroid the first web application framework 
that is explicitly designed for embedded applications, with 
a particular emphasis on secure web application design. 
We motivate our work with extensive results from audits 
carried out over the last two years on a broad range of em- 
bedded web servers. We evaluate WebDroid performance 
and show that despite the fact that that performance take 
a 20% hit when we all the security features are activated, 
WebDroid remains sufficiently fast for its purpose. Finally 
as a case study we build two sample web applications. 
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ZOZZLE: Fast and Precise In-Browser JavaScript Malware Detection 
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Univ. of Mass., Amherst 


Abstract 


JavaScript malware-based attacks account for a large 
fraction of successful mass-scale exploitation happening 
today. Attackers like JavaScript-based attacks because 
they can be mounted against an unsuspecting user visit- 
ing a seemingly innocent web page. While several tech- 
niques for addressing these types of exploits have been 
proposed, in-browser adoption has been slow, in part be- 
cause of the performance overhead these methods incur. 

In this paper, we propose ZOZZLE, a low-overhead so- 
lution for detecting and preventing JavaScript malware 
that is fast enough to be deployed in the browser. 

Our approach uses Bayesian classification of hier- 
archical features of the JavaScript abstract syntax tree 
to identify syntax elements that are highly predictive 
of malware. Our experimental evaluation shows that 
ZOZZLE is able to detect JavaScript malware through 
mostly static code analysis effectively. ZozZzLe has an 
extremely low false positive rate of 0.0003%, which is 
less than one in a quarter million. Despite this high ac- 
curacy, the ZOozZLE classifier is fast, with a throughput of 
over one megabyte of JavaScript code per second. 


1 Introduction 


In the last several years, we have seen mass-scale ex- 
ploitation of memory-based vulnerabilities migrate to- 
wards heap spraying attacks. This is because more tra- 
ditional vulnerabilities such as stack- and heap-based 
buffer overruns, while still present, are now often mit- 
igated by compiler techniques such as StackGuard [7] 
or operating system mechanisms such as NX/DEP and 
ALSR [12]. While several heap spraying solutions have 
been proposed [8, 9, 21], arguably, none are lightweight 
enough to be integrated into a commercial browser. 
However, a browser-based detection technique is still 
attractive for several reasons. Offline scanning is often 
used in modern browsers to check whether a particular 
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site the user visits is benign and to warn the user other- 
wise. However, because it takes a while to scan a very 
large number of URLs that are in the observable web, 
some URLs will simply be missed by the scan. Offline 
scanning is also not as effective against transient mal- 
ware that appears and disappears frequently. 

ZOZZLE is a mostly static JavaScript malware detec- 
tor that is fast enough to be used in a browser. While 
its analysis is entirely static, ZoZZLE has a runtime com- 
ponent: to address the issue of JavaScript obfuscation, 
ZOZZLE is integrated with the browser’s JavaScript en- 
gine to collect and process JavaScript code that is cre- 
ated at runtime. Note that fully static analysis is difficult 
because JavaScript code obfuscation and runtime code 
generation are so common in both benign and malicious 
code. 


Challenges: Any technical solution to the problem out- 
lined above requires overcoming the following chal- 
lenges: 


e performance: detection is often too slow to be de- 
ployed in a mainstream browser; 

e obfuscated malware: because both benign and ma- 
licious JavaScript code is frequently obfuscated, 
purely static detection is generally ineffective; 

e low false positive rates: given the number of URLs 
on the web, while false positive rates of 5% are 
considered acceptable for, say, static analysis tools, 
rates even 100 times lower are not acceptable for 
in-browser detection; 

e malware transience: transient malware compro- 
mises the effectiveness of offline-only scanning. 


Because it works in a browser, ZOZZLE uses the Java- 
Script runtime engine to expose attempts to obscure mal- 
ware via uses of eval, document.write, etc. by hooking 
the runtime and analyzing the JavaScript just before it 
is executed. We pass this unfolded JavaScript to a static 
classifier that is trained using features of the JavaScript 
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AST (abstract syntax tree). We train the classifier with a 
collection of labeled malware samples collected with the 
NOZZLE dynamic heap-spraying detector [21]. Related 
work [4, 6, 14, 22] also classifies JavaScript malware us- 
ing a combination of static and dynamic features, but re- 
lies on emulation to deobfuscate the code and to observe 
dynamic features. Because we avoid emulation, our anal- 
ysis is faster and, as we show, often superior in accuracy. 


Contributions: this paper makes these contributions: 


e Mostly static malware detection. We propose 
ZOZZLE, a highly precise, lightweight, mostly static 
JavaScript malware detector. ZOZZLE is based on 
extensive experience analyzing thousands of real 
malware sites found while performing dynamic 
crawling of millions of URLs using the NOZZLE 
runtime detector. 

e AST-based detection. We describe an AST-based 
technique that involves the use of hierarchical 
(context-sensitive) features for detecting malicious 
JavaScript code. This context-sensitive approach 
provides increased precision in comparison to naive 
text-based classification. 

e Fast classification. Because fast scanning is key to 
in-browser adoption, we present fast multi-feature 
matching algorithms that scale to hundreds or even 
thousands of features. 

e Evaluation. We evaluate ZozZLE in terms of per- 
formance and malware detection rates, both false 
positives and false negatives. ZOZZLE has an ex- 
tremely low false positive rate of 0.0003%, which is 
less than one in a quarter million, comparable to five 
commericial anti-virus products we tested against. 
To obtain these numbers, we tested ZOZZLE against 
a collection of over 1.2 million benign JavaScript 
samples. Despite this high accuracy, the classifier is 
very fast, with a throughput at over one megabyte 
of JavaScript code per second. 


Classifier-based tools are susceptible to being circum- 
vented by an attacker who knows the inner workings of 
the tool and is familiar with the list of features being 
used, however, our preliminary experience with ZOZZLE 
suggests that it is capable of detecting many thousands of 
malicious sites daily in the wild. We consider the issue 
of evasion in Section 6. 


Paper Organization: The rest of the paper is organized 
as follows. Section 2 gives some background informa- 
tion on JavaScript exploits and their detection and sum- 
marizes our experience of performing offline scanning 
with NOZZLE on a large scale. Section 3 describes the 
implementation of our analysis. Section 4 describes our 
experimental methodology. Section 5 describes our ex- 
perimental evaluation. Section 6 provides a discussion 
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<html> 
<body> 
<button id="butid" onclick="trigger();" 
style="display:none"/> 
<script> 
// Shellcode 
var shellcode=unescape (’ \%u9090\%u9090\%u9090\%u9090...’); 
bigblock=unescape (’\Su0DOD\%u0DOD’ ) ; 
headersize=20; 
shellcodesize=headersize+shellcode. length; 
while (bigblock. length<shellcodesize) {bigblock+=bigblock; } 
heapshell=bigblock . substring (0, shellcodesize) ; 
nopsled=bigblock. substring (0, 
bigblock. length-shellcodesize) ; 
while (nopsled. length+shellcodesize<0x25000) { 
nopsled=nopsled+nopsled+heapshell 
} 
// Spray 
var spray=new Array (); 
for (i=0; i<500; i++) {spray [i]=nopsled+shellcode; } 
// Trigger 
function trigger () { 
var varbdy = document .createElement (’ body’ ) ; 
varbdy .addBehavior (’ #default#userData’ ) ; 
document . appendChild (varbdy) ; 
try { 
for (iter=0; iter<10; iter++) { 
varbdy.setAttribute(’s’ , window) ; 
} catch(e){ } 
window. statust='"; 
} 
document . getElementById(’butid’) .onclick() ; 
} 
</script> 
</body> 
</htm1> 


Figure 1: Heap spraying attack example. 


of the limitations and deployment concerns for ZOZZLE. 
Section 7 discusses related work, and, finally, Section 8 
concludes. 

Appendices are organized as follows. Appendix A 
discusses some of the hand-analyzed malware samples. 
Appendix B explores tuning ZoZZLE for better precision. 
Appendix C shows examples of non-heap spray malware 
and also anti-virus false positives. 


2 Background 


This section gives overall background on JavaScript- 
based malware, focusing specifically on heap spraying 
attacks. 


2.1 JavaScript Malware Background 


Figure 1 shows an example of real JavaScript malware 
that performs a heap spray. Such malware consists of 
three relatively independent parts. The shellcode is the 
portion of executable machine code that will be placed 
on the browser heap when the exploit is executed. It is 
typical to precede the shellcode with a block of NOP in- 
structions (so-called NOP sled). The sled is often quite 
large compared to the size of the subsequence shellcode, 
so that a random jump into the process address space is 
likely to hit the NOP sled and slide down to the start of 
the shellcode. The next part is the spray, which allocates 
many copies of the NOP sled/shellcode in the browser 
heap. In JavaScript, this is easily accomplished using an 
array of strings. Spraying of this sort can be used to de- 
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feat address space layout randomization (ASLR) protec- 
tion in the operating system. The last part of the exploit 
triggers a vulnerability in the browser; in this case, the 
vulnerability is a well-known flaw in Internet Explorer 6 
that exploits a memory corruption issue with function 
addBehavior. 

Note that the example in Figure | is entirely unob- 
fuscated, with the attacker not even bothering to rename 
variables such as shellcode, nopsled, and spray to make 
the attack easier to spot. In practice, many attacks are 
obfuscated prior to deployment, either by hand, or using 
one of many available obfuscation kits [11]. To avoid de- 
tection, the primary technique used by obfuscation tools 
is to use eval unfolding, i.e. self-generating code that 
uses the eval construct in JavaScript to produce more 
code to run. 


2.2 Characterizing Malicious JavaScript 


ZOZZLE training is based on results collected with the 
NOZZLE heap spraying detector. To gather the data we 
use to train the ZOZZLE classifier and evaluate it, we em- 
ployed a web crawler to visit many randomly selected 
URLs and process them with NOZZLE to detect if mal- 
ware was present. 

Once we determine that JavaScript is malicious, we 
invested a considerable effort in examining the code by 
hand and categorizing in various ways. One of the in- 
sights we gleaned from this process is that once unfolded, 
most malware does not have that much variety, following 
the traditional long tail pattern. We discuss some of the 
hand-analyzed samples in Appendix A. 

Any offline malware detection scheme must deal with 
the issues of transience and cloaking. Transient mali- 
cious URLs go offline or become benign after some pe- 
riod of time, and cloaking is when an attack hides itself 
from a particular user agent, IP address range, or from 
users who have visited the page before. While we tried 
to minimize these effects in practice by scanning from a 
wider range of IP addresses, in general, these issues are 
difficult to fully address. 

Figure 2 summarizes information about malware tran- 
sience. To compute the transience of malicious sites, we 
re-scan the set of URLs detected by Nozzle on the previ- 
ous day. This procedure is repeated for three weeks (21 
days). The set of all discovered malicious URLs were 
re-scanned on each day of this three week period. This 
means that only the URLs discovered on day one were 
re-scanned 21 days later. The URLs discovered on day 
one happened to have a lower transience rate than other 
days, so there is a slight upward slope toward the end of 
the graph. 

Any offline scanning technique will have difficulty 
keeping up with malware exhibiting such a high rate of 
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Figure 2: Transience of detected malicious URLs after several days. 
The number of days is shown of the x axis, the percentage of remaining 
malware is shown on the y axis. 
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Figure 3: Unfolding tree: an example. Rectangles are documents, 
and circles are JavaScript contexts. Gray circles are benign, black are 
malicious, and dashed are “co-conspirators” that participate in deob- 
fuscation. Edges are labeled with the method by which the context or 
document was reached. The actual page contains 10 different exploits 
using the same obfuscation. 


transience—Nearly 20% of malicious URLs were gone af- 
ter a single day. We believe that in-browser detection 
is desirable, in order to be able to detect new malware 
before it has a chance to affect the user regardless of 
whether the URL being visited has been scanned before. 


2.3 Dynamic Malware Structure 


One of the core issues that needs to be addressed when 
talking about JavaScript malware is the issue of obfusca- 
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Figure 4: Distribution of context counts for malware and benign code. 


tion. In order to avoid detection, malware writers resort 
to various forms of JavaScript code obfuscation, some of 
which is done by hand, other with the help of many avail- 
able obfuscation toolkits [11]. While many approaches 
to code obfuscation exist, in our experience we see eval 
unfolding as the most commonly used. The idea is to use 
the eval language feature to generate code at runtime in 
a way that makes the original code difficult to pattern- 
match. Often, this form of code unfolding is used repeat- 
edly, so that many levels of code are produced before the 
final, malicious version emerges. 


Example 1 Figure 3 illustrates the process of code un- 
folding using a specific malware sample obtained from 
a web site http://es.doowon.ac.kr. At the time of 
detection, this malicious URL flagged by NOZZLE con- 
tained 10 distinct exploits, which is not uncommon for 
malware writers, who tend to “over-provision” their ex- 
ploits: to increase the changes to successful exploitation, 
they may include multiple exploits within the same page. 
Each exploit in our example is pulled in with an <iframe> 
tag. 

Each of these exploits is packaged in a similar fashion. 
The leftmost context is the result of an eval in the body of 
the page that defines a function. Another eval call from 
the body of the page uses the newly-defined function to 
define another new function. Finally, this function and 
another eval call from the body exposes the actual ex- 
ploit. Surprisingly, this page also pulls in a set of benign 
contexts, consisting of page trackers, JavaScript frame- 
works, and site-specific code. 

Note, however, that the presence of eval unfolding 
does not provide a reliable indication of malicious in- 
tent. There are plenty of perfectly benign pages that also 
perform some form of code obfuscation, for instance, as 
a weak form of copy protection to avoid code piracy. 
Many commonly used JavaScript library frameworks do 
the same, often to save space through client-side code 
generation. 














20th USENIX Security Symposium 








benign 
contexts 


gy 


() string 





JS Engine 





file.js/eval context malicious 


contexts 














extraction and labeling | 











initial features 








—> filtering 








feature selection 


predictive features 

















Vv 


training <— 


features + weights 


——>|_ Bayesian classifier )— 




















JavaScript file 


classification 














Figure 5: ZOZZLE training illustrated. 


We instrumented the ZozzLe deobfuscator to collect 
information about which code context leads to other 
code contexts, allowing us to collect information about 
the number of code contexts created and the unfolding 
depth. Figure 4 shows a distributions of JavaScript con- 
text counts for benign and malicious URLs. The ma- 
jority of URLs have only several JavaScript code con- 
texts, however, many can be have 50 or more, created 
through either <iframe> or <script> inclusion or eval un- 
folding. Some pages, however, may have as many as 200 
code contexts. In other words, a great deal of dynamic 
unfolding needs to take place before these contexts will 
“emerge” and will be available for analysis. 

It is clear from the graph in Figure 4 that, contrary to 
what might have been thought, the number of contexts is 
not a good indicator of a malicious site. Context counts 
were calculated for all malicious URLs from a week of 
scanning with NozzLe and a random sample of benign 
URLs over the same period. 


3 Implementation 


In this section, we discuss the details of the ZOZZLE im- 
plementation. 


3.1 Overview 


Much of ZozzLe’s design and implementation has in ret- 
rospect been informed by our experience with reverse 
engineering and analyzing real malware found by Noz- 
ZLE. Figure 5 illustrates the major parts of the ZOZZLE 
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architecture. At a high level, the process evolves in three 
stages: JavaScript context collection and labeling as be- 
nign or malicious, feature extraction and training of a 
naive Baysian classifier, and finally, applying the clas- 
sifier to a new JavaScript context to determine if it is be- 
nign or malicious. In the following section, we discuss 
the details of each of these stages in turn. 


3.2 Training Data Extraction and Labeling 


ZOZZLE makes use of a statistical classifier to efficiently 
identify malicious JavaScript. The classifier needs train- 
ing data to accurately classify JavaScript source, and 
we describe the process we use to get that training data 
here. We start by augmenting the JavaScript engine in 
a browser with a “deobfuscator” that extracts and col- 
lects individual fragments of JavaScript. As discussed 
above, exploits are frequently buried under multiple lev- 
els of JavaScript eval. Unlike Nozzle, which observes 
the behavior of running JavaScript code, ZOZZLE must 
be run on an unobfuscated exploit to reliably detect ma- 
licious code. 

While detection on obfuscated code may be possible, 
examining a fully unpacked exploit is most likely to re- 
sult in accurate detection. Rather than attempt to deci- 
pher obfuscation techniques, we leverage the simple fact 
that an exploit must unpack itself to run. 

Our experiments presented in this paper involved 
instrumenting the Internet Explorer browser, but we 
could have used a different browser such as Firefox or 
Chrome instead. Using the Detours binary instrumenta- 
tion library [13], we were able to intercept calls to the 
Compile function in the JavaScript engine located in the 
jscript.d11 library. This function is invoked when eval 
is called and whenever new code is included with an 
<iframe> Or <script> tag. This allows us to observe Java- 
Script code at each level of its unpacking just before it is 
executed by the engine. We refer to each piece of Java- 
Script code passed to the Compile function as a code con- 
text. For purposes of evaluation, we write out each con- 
text to disk for post-processing. In a browser-based im- 
plementation, context assessment would happen on the 
fly. 


3.3. Feature Extraction 


Once we have labeled JavaScript contexts, we need to 
extract features from them that are predictive of mali- 
cious or benign intent. For ZOZZLE, we create features 
based on the hierarchical structure of the JavaScript ab- 
stract syntax tree (AST). Specifically, a feature consists 
of two parts: a context in which it appears (such as a 
loop, conditional, try/catch block, etc.) and the text (or 
some substring) of the AST node. For a given JavaScript 
context, we only track whether a feature appears or not, 
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and not the number of occurrences. To efficiently ex- 
tract features from the AST, we traverse the tree from the 
root, pushing AST contexts onto a stack as we descend 
and popping them as we ascend. 

To limit the possible number of features, we only ex- 
tract features from specific nodes of the AST: expres- 
sions and variable declarations. At each of the expression 
and variable declarations nodes, a new feature record is 
added to that script’s feature set. 

If we use the text of every AST expression or variable 
declaration observed in the training set as a feature for 
the classifier, it will perform poorly. This is because most 
of these features are not informative (that is, they are not 
correlated with either benign or malicious training set). 
To improve classifier performance, we instead pre-select 
features from the training set using the x7 statistic to 
identify those features that are useful for classification. 
A pre-selected feature is added to the script’s feature set 
if its text is a substring of the current AST node and the 
contexts are equal. The method we used to select these 
features is described in the following section. 


3.4 Feature Selection 


As illustrated in Figure 5, after creating an initial fea- 
ture set, ZOZZLE performs a filtering pass to select those 
features that are likely to be most predictive. For this 
purpose, we used the x? algorithm to test for correla- 
tion. We include only those features whose presence is 
correlated with the categorization of the script (benign or 
malicious). The y? test (for one degree of freedom) is 
described below: 


A = malicious contexts with feature 
B = benign contexts with feature 
C' = malicious contexts without feature 


D = benign contexts without feature 


2 (Ax D—C xB)? 
xX (A4+C)*(B+D)*(A+B)*(C+D) 





We selected features with x? > 10.83, which corre- 
sponds with a 99.9% confidence that the two values (fea- 
ture presence and script classification) are not indepen- 
dent. 


3.5 Classifier Training 


ZOZZLE uses a naive Bayesian classifier, one of the sim- 
plest statistical classifiers available. When using naive 
Bayes, all features are assumed to be statistically inde- 
pendent. While this assumption is likely incorrect, the 
independence assumption has yielded good results in the 
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past. Because of its simplicity, this classifier is efficient 
to train and run. 

The probability assigned to label L; for code fragment 
containing features F),..., /,, may be computed using 
Bayes rule as follows: 


P(L;)P(Fi,..., Fn|Li) 


Because the denominator is constant regardless of L; we 
ignore it for the remainder of the derivation. Leaving 
out the denominator and repeatedly applying the rule of 
conditional probability, we rewrite this as: 


Pg Pig = PU) [| PEM Fey) 
k=1 


Given that features are assumed to be conditionally inde- 
pendent, we can simplify this to: 


P(Li|Fi,..-; Pn) = P(Li) ]] P(A lz) 
k=l 


Classifying a fragment of JavaScript requires travers- 
ing its AST to extract the fragment’s features, multiply- 
ing the constituent probabilities of each discovered fea- 
ture (actually implemented by adding log-probabilities), 
and finally multiplying by the prior probability of the la- 
bel. It is clear from the definition that classification may 
be performed in linear time, parameterized by the size 
of the code fragment’s AST, the number of features be- 
ing examined, and the number of possible labels. The 
processes of collecting and hand-categorizing JavaScript 
samples and training the ZozzLe classifier are detailed in 
Section 4. 


3.6 Fast Pattern Matching 


An AST node contains a feature if the feature’s text is a 
substring of the AST node. With a naive approach, each 
feature must be matched independently against the node 
text. To improve performance, we construct a state ma- 
chine for each context that reduces the number of charac- 
ter comparisons required. There is a state for each unique 
character occurring at each position in the features for a 
given context. 

A pseudocode for the fast matching algorithm is 
shown in Figure 7. State transitions are selected based 
on the next character in the node text. Every state has a 
bit mask with bits corresponding to features. The bits are 
set only for those features that have the state’s incom- 
ing character at that position. At the beginning of the 
matching, a bitmap is set to all ones. This mask is AND- 
ed with the mask at each state visited during matching. 
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At the end of matching, the bit mask contains the set of 
features present in the node. This process is repeated 
for each position in the node’s text, as features need not 
match at the start of the node. 


Example 2 An example of a state machine used for fast 
pattern matching is shown in Figure 6. This string match- 
ing state machine can identify three patterns: 
append, and insert. Assume the matcher is running on 
input text appert. During execution, a bit array of size 
three, called the matched list, is kept to indicate the pat- 
terns that have been matched up to this point in the in- 
put. This bit array starts with all bits set. From the left- 
most state we follow the edge labeled with the input’s 
first character, in this case an a. 


alert, 


The match list is bitwise-anded with this new state’s 
bit mask of 110. This process is repeated for the input 
characters p, p, e. At this point, the match list contains 010 
and the remaining input characters are r, t, and nu11 (also 
notated as \o). Even though a path to an end state exists 
with edges for the remaining input characters, no patterns 
will be matched. The next character consumed, an r, 
takes the matcher to a state with mask 001 and match 
list of 010. Once the match list is masked for this state, 
no patterns can possibly be matched. For efficiency, the 
matcher terminates at this point and returns the empty 
match list. 


The maximum number of comparisons required to 
match an arbitrary input with this matcher is 17, ver- 
sus 20 for naive matching (including null characters at 
the ends of strings). The worst-case number of compar- 
isons performed by the matcher is the total number of 
distinct edge inputs at each input position. The sample 
matcher has 19 edges, but at input position 3 two edges 
consume the same character (’e’), and at input position 6 
two edges consume the null character. In practice, we 
find that the number of comparisons is reduced signifi- 
cantly more than for this sample, due to the large number 
of features because of the pigeonhole principle. 














For a classifier using 100 features, a single position in 
the input text would require 100 character comparisons 
with naive matching. Using the state machine approach, 
there can be no more than 52 comparisons at each string 
position (36 alphanumeric characters and 16 punctuation 
symbols), giving a reduction of nearly 50%. In practice 
there are even more features, and input positions do not 
require matching against every possible input character. 


Figure 8 clearly shows the benefit of fast pattern 
matching over a naive matching algorithm. The graph 
shows the average number of character comparisons per- 
formed per-feature using both our scheme and a naive 
approach that searches an AST node’s text for each pat- 
tern individually. As can be seen from the figure, the 
fast matching approach has far fewer comparisons, de- 
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aie. anh 


Figure 6: Fast feature matching illustrated. 


matchList < (1,1,...,1) 
state <0 
for all cin input do 
state < matcher.getNextState(state, c) 
matchList <~ matchList A matcher.getMask(state) 
if matchList(0,0,...,0) then 
return matchList 
end if 
end for 
return matchList 


Figure 7: Fast matching algorithm. 





= =Naive Matching -@ Fast Matching 





—— 
25 4 -=- 
= 
e Fem eee Ame aa 
2 & = 
5 oe 
° 
* 20 ° 
gt » 
a 
2 1 # 
giffu 
o 
Qa 
E 
8 
< 10 
g 
= Pe 
“a 
e 
e- 
~e -0--.-- 
o |, 9 = 0= <& = = 9 - 6-8 - 9 ee ee 
oO Oo Oo So So So So So So Oo So So Oo So So Oo 
oS o So So So oS 2 So So So o So So So So 
a nN a + nm © 2 a aoa a a ot m 
f° of to eo oa 


Features 


Figure 8: Comparisons required per-feature with naive vs. fast pattern 
matching. The number of features is shown on the x axis. 


creasing asymptotically as the number of features ap- 
proaches 1,500. 


3.7 Future Improvements 


In this section, we describe additional algorithmic im- 
provements not present in our initial implementation. 


3.7.1 Automatic Malware Clustering 


Using the same features extracted for classification, it 
is possible to automatically cluster attacks into groups. 
There are two possible approaches that exist in this 
space: supervised and unsupervised clustering. 
Supervised clustering would consist of hand- 
categorizing attacks, which has actually already been 
done for about 1,000 malicious contexts, and assigning 
new scripts to one of these groups. Unsupervised 
clustering would not require the initial sorting effort, 
and is more likely to successfully identify new, common 
attacks. It is likely that feature selection would be an 
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ongoing process; selected features should discriminate 
between different clusters, and these clusters will likely 
change over time. 


3.7.2 Substring Feature Selection 


For the current version of ZOZZLE, automatic feature se- 
lection only considers the entire text of an AST node as 
a potential feature. While simply taking all possible sub- 
strings of this and treating those as possible features as 
well may seem reasonable, the end result is a classifier 
with many more features and little (if any) improvement 
in classification accuracy. 


An alternative approach would be to treat certain types 
of AST nodes as “divisible” when collecting candidate 
features. If the entire node text is not a good discrimi- 
native feature, its component substrings can be selected 
as candidate features. This avoids introducing substring 
features when the full text is sufficiently informative, but 
allows for simple patterns to be extracted from longer 
text (such as %u or %u0c0c) when they are more informa- 
tive than the full string. Not all AST nodes are suitable 
for subdivision, however. Fragments of identifiers don’t 
necessarily make sense, but string constants and numbers 
could still be meaningful when split apart. 


3.7.3 Feature Flow 


At the moment, features are extracted only from the text 
of the AST nodes in a given context. This works well for 
whole-script classification, but has yielded more limited 
results for fine-grained classification (that is, to identify 
that a specific part of the script is malicious). To prevent 
a particular feature from appearing in a particularly infor- 
mative context (such as COMMENT appearing inside a loop, a 
component the Aurora exploit [19]) an attacker can sim- 
ply assign this string to a variable outside the loop and 
reference the variable within the loop. The idea behind 
feature flow is to keep a simple lookup table for iden- 
tifiers, where both the identifier name and its value are 
used to extract features from an AST node. 


By ignoring scoping rules and loops, we can get a rea- 
sonable approximation of the features present in both the 
identifiers and values within a given context with low 
overhead. This could be taken one step further by em- 
ulating simple operations on values. For example, if two 
identifiers set to strings are added, the values of these 
strings could be concatenated and then searched for fea- 
tures. This would prevent attackers from hiding common 
shellcode patterns using concatenation. 
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4 Experimental Methodology 


In order to train and evaluate ZOZZLE, we created a col- 
lection of malicious and benign JavaScript samples to use 
as training data and for evaluation. 


Gathering Malicious Samples: To gather the results 
for Section 5, we first dynamically scanned URLs with 
a browser running both NOZZLE and the ZozzLe Java- 
Script deobfuscator. In this configuration, when NOZZLE 
detects a heap spraying exploit, we record the URL and 
save to disk all JavaScript contexts seen by the deobfus- 
cator. All recorded JavaScript contexts are then hand- 
examined to identify those that contain any malware ele- 
ments (shellcode, vulnerability, or heap-spray). 

Malicious contexts can be sorted efficiently by first 
grouping by their md5 hash value. This dramatically re- 
duces the required effort because of the lack of exploit 
diversity explained first in Section 2 and relatively few 
identifier-renaming schemes being employed by attack- 
ers. For exploits that do appear with identifier names 
changed, there are still usually some identifiers left un- 
changed (often part of the standard JavaScript APD) 
which can be identified using the grep utility. Finally, 
hand-examination is used to handle the few remaining 
unsorted exploits. Using a combination of these tech- 
niques, 919 deobfuscated malicious contexts were iden- 
tified and sorted in several hours. 


Gathering Benign Samples: To create a set of benign 
JavaScript contexts, we extracted JavaScript from the 
Alexa.com top 50 URLs using the ZOoZZLE deobfuscator. 
The 7,976 contexts gathered from these sites were used 
as our benign dataset. 


Feature Selection: To evaluate ZoZZLE, we partition our 
malicious and benign datasets into training and evalua- 
tion data and train a classifier. We then apply this classi- 
fier to the withheld samples and compute the false posi- 
tive and negative rates. To train a classifier with ZOZZLE, 
we first need a define a set of features from the code. 
These features can be hand-picked, or automatically se- 
lected (as described in Section 3) using the training ex- 
amples. In our evaluation, we compare the performance 
of classifiers built using hand-picked and automatically 
selected features. 
The 89  hand- 
picked features 
were selected based 
on experience and 
intuition with many 
pieces of malware 
detected by Noz- 
ZLE and involved 
collecting particu- 
larly “memorable” 


Feature 


try : unescape 

loop : spray 

loop : payload 
function : addbehavior 
string : Oc 


Figure 9: Examples of hand-picked fea- 
tures used in our experiments. 
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Feature Present M:B 
function : anonymous v1: 4609 
try : newactivexobject(”pdf.pdfctr1” ) v 1309: 1 
loop : scode v 1211:1 
function : $(this) v 1:1111 
if :”shel” +”l.ap” +”pl” +”icati”+”on” JS  997:1 
string : %u0c0c%u0c0c ¥ —- 993: 1 
loop : shellcode v 895: 1 
function : collectgarbage() v 175: 1 
string : #default#userdata v 10:1 
string: %u x 1:6 


Figure 10: Sample of automatically selected features and their dis- 
criminating power as a ratio of likelihood to appear in a malicious or 
benign context. 


features frequently 
repeated in malware samples. 

Automatically selecting features typically yields many 
more features as well as some features that are biased 
toward benign JavaScript code, unlike hand-picked fea- 
tures that are all characteristic of malicious JavaScript 
code. Examples of some of the hand-picked features 
used are presented in Figure 9. 

For comparison purposes, samples of the automati- 
cally extracted features, including a measure of their dis- 
criminating power, are shown in Figure 10. The mid- 
dle column shows whether it is the presence of the fea- 
ture (Vv) or the absence of it (X) that we are matching on. 
The last column shows the number of malicious (M) and 
benign (B) contexts in which they appear in our training. 


In addition to the feature selection methods, we also 
varied the types of features used by the classifier. Be- 
cause each token in the Abstract Syntax Tree (AST) ex- 
ists in the context of a tree, we can include varying parts 
of that AST context as part of the feature. Flat features 
are simply text from the JavaScript code that is matched 
without any associated AST context. We should empha- 
size that flat features are what are typically used in var- 
ious text classification schemes. What distinguishes our 
work is that, through the use of hierarchical features, we 
are taking advantage of the contextual information given 
by the code structure to get better precision. 

Hierarchical features, either 1- or n-level, contain a 
certain amount of AST context information. For exam- 
ple, 1-level features record whether they appear within 
a loop, function, conditional, try/catch block, etc. Intu- 
itively, a variable called shellcode declared or used right 
after the beginning of a function is perhaps less indica- 
tive of malicious intent than a variable called she1lcode 
that is used with a loop, as is common in the case of a 
spray. For n-level features, we record the entire stack of 
AST contexts such as 


(a loop, within a conditional, within a function, . . .) 
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Features Hand-Picked Automatic Features 
flat 95.45% 99.48% 948 
1-level 98.51% 99.20% 1,589 
n-level 96.65% 99.01% 2,187 


Figure 11: Classifier accuracy for hand-picked and automatically se- 
lected features. 


Hand-Picked Automatic 


False Pos. False Neg. False Pos. False Neg. 


Features 





flat 4.56% 4.51% 0.01% 5.84% 
1-level 1.52% 1.26% 0.00% 9.20% 
n-level 3.18% 5.14% 0.02% 11.08% 


Figure 12: False positives and false negatives for flat and hierarchical 
features using hand-picked and automatically selected features. 


The depth of the AST context presents a tradeoff between 
accuracy and performance, as well as between false pos- 
itives and false negatives. We explore these tradeoffs in 
detail in Section 5. 


5 Evaluation 


In this section, we evaluate the effectiveness of ZOZZLE 
using the benign and malicious JavaScript samples de- 
scribed in Section 4. To obtain the experimental results 
presented in this section, we used an HP xw4600 work- 
station (Intel Core2 Duo E8500 3.16 Ghz, dual proces- 
sor, 4 Gigabytes of memory), running Windows 7 64-bit 
Enterprise. 


5.1 False Positives and False Negatives 


Accuracy: Figure 11 shows the overall classification ac- 
curacy of ZOZZLE when evaluated using our malicious 
and benign JavaScript samples. The accuracy is mea- 
sured as the number of successful classifications divided 
by total number of samples. In this case, because we have 
many more benign samples than malicious samples, the 
overall accuracy is heavily weighted by the effectiveness 
of correctly classifying benign samples. 

In the figure, the results are sub-divided first by 
whether the features are selected by hand or using the au- 
tomatic technique described in Section 3, and then sub- 
divided by the amount of context used in the classifier 
(flat, 1-level, and n-level). 


'Unless otherwise stated, for these results 25% of the samples 
were used for classifier training and the remaining files were used 
for testing. Each experiment was repeated five times on a different 
randomly-selected 25% of hand-sorted data. 
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ZOZZLE AVI AV2 AV3 AV4 AVS 
Samples 1,275,033 1,275,078 
True pos. 5 3 0 3 1 3 
False pos. 4 2 5 5 4 3 
FP rate 3.1E-6 | 1.6E-6 3.9E-6 2.9E-6 3.1E-6 2.4E-6 





Figure 13: False positive rate comparison. 


The table shows that overall, automatic feature selec- 
tion significantly outperforms hand-picked feature selec- 
tion, with an overall accuracy above 99%. Second, we 
see that while some context helps the accuracy of the 
hand-picked features, overall, context has little impact on 
the accuracy of automatically selected features. We also 
see in the fourth column the number of features that were 
selected in the automatic feature selection. As expected, 
the number of features selected with the n-level classifier 
is significantly larger than the other approaches. 


Hand-picked vs. Automatic: Figure 12 expands on the 
above results by showing the false positive and false neg- 
ative rates for the different feature selection methods and 
levels of context. The rates are computed as a fraction of 
malicious and benign samples, respectively. We see from 
the figure that the false positive rate for all configurations 
of the hand-picked features is relatively high (1.5-4.5%), 
whereas the false positive rate for the automatically se- 
lected features is nearly zero. The best case, using au- 
tomatic feature selection and 1-level of context, has no 
false positives in any of the randomly-selected training 
and evaluation subsets. The false negative rate for all 
the configurations is relatively high, ranging from 1—-11% 
overall. While this suggests that some malicious contexts 
are not being classified correctly, for most purposes, hav- 
ing high overall accuracy and low false positive rate are 
the most important attributes of a malware classifier. 


Best classifier: In contrast to the lower false positive 
rates, the false negative rates of the automatically se- 
lected features are higher than they are for the hand- 
picked features. The insight we have is that the automatic 
feature selection selects many more features, which im- 
proves the sensitivity in terms of false positive rate, but 
at the same time reduces the false negative effectiveness 
because extra benign features can sometimes mask mali- 
cious intent. We see that trend manifest itself among the 
alternative amounts of context in the automatically se- 
lected features. The n-level classifier has more features 
and a higher false negative rate than the flat or 1-level 
classifiers. Since we want to achieve a very low false 
positive rate with a moderate false negative rate, and the 
1-level classifier provided the best false positive rate in 
these experiments, in the remainder of this section, we 
consider the effectiveness of the 1-level classifier in more 
detail. 
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ZOZZLE JSAND AV1 AV2 AV3 AV4 AVS 








9% 15% 24% 28% 34% 83% 42% 














Figure 14: False negative rate comparison. 


5.2. Comparison with AV & Other Techniques 


Previous analysis has been performed on a relatively 
small set of benign files. As a result, our 1-level clas- 
sifier does not produce any false alarms on about 8,000 
benign samples, but using a set of this size limits the pre- 
cision of our evaluation. To fully understand the false 
positive rate of ZOZZLE, we have obtained a large collec- 
tion of over 1.2 million benign JavaScript contexts taken 
from manually white-listed web sites. 


Investigating false positives further: Figure 13 shows 
the results of running both ZoZZLE and five state-of-the- 
art anti-virus products on the large benign data set. Out 
of the 1.2 million files, only 4 were incorrectly marked 
malicious by ZozzLe. This is fewer than one in a quar- 
ter million false alarms. The four false positives flagged 
by Zozz_e fell into two distinct cases and both cases 
were essentially a single large JSON-like data structure 
that included many instances of encoded binary strings. 
Adding a specialized JSON data recognizer to ZOZZLE 
could eliminate these false alarms. 

Even though anti-virus products attempt to be ex- 
tremely careful about false positives, in our run, the 
five anti-virus engines produced 29 alerts when applied 
to 1,275,078 JavaScript samples. 

Our of these, over half, 19 alerts turn out to be false 
positives. We investigated these further and found sev- 
eral reasons for these errors. The first is assuming that 
some document.write of an unescaped string could be ma- 
licious when they in fact were not. The second reason 
is flagging unpackers, i.e. pieces of code that convert 
a string into another one through character code trans- 
lation. Clearly, these unpackers alone are not malicious. 
We show examples of these mistakes in Appendix B. The 
third reason is overly aggressively flagging phishing sites 
that insert links into the current page; this is because the 
anti-virus is unable to distinguish between them and mal- 
ware. The figure also shows cases where we found true 
malware in the large data set (listed as true positives), 
despite the fact that the web sites that the JavaScript was 
taken from were white-listed. We see that ZOZZLE was 
also better at finding true positives than the anti-virus de- 
tectors, finding a total of five out of the 1.2 million sam- 
ples. We also note that the number of samples used in the 
anti-virus and ZOZZLE results in this table are slightly dif- 
ferent due to the fact that on some of the samples either 
the anti-virus or ZOZZLE aborts due to ill-formed Java- 
Script syntax and those samples are not included in the 
total. 
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In summary, ZozzLe has a false positive rate 
of 0.0003%, which is comparable to the five anti-virus 
tools in all cases and is better than some of them. 


Investigating false negatives further: Figure 14 shows 
a comparison of ZOZZLE and the five anti-virus engines 
discussed above. We fed the anti-virus engines the 919 
hand-labeled malware samples used in the previous eval- 
uation of ZOZZLE.” Additionally, we include JSAND [6], 
a recently published malware detector that has a public 
web interface for malware upload and detection. In the 
case of JSAND, we only used a small random sample 
of 20 malicious files due to the difficulty of automat- 
ing the upload process, apparent rate limiting, and the 
latency of JSAND evaluation. The figure demonstrates 
that all of the other products have a higher false negative 
rate compared to ZOZZLE. JSAND is the closest, produc- 
ing a false negative rate of 15%. We feel that these high 
false negative rates for the anti-virus products are likely 
caused by the tendency of such products to be conserva- 
tive and trade low false positives for higher false nega- 
tives. This experiment illustrates the difficulty that tradi- 
tional anti-virus techniques have classifying JavaScript, 
where self-generated code is commonplace. We feel that 
ZOZZLE excels in both dimensions. 


5.3 Classifier Performance 


Figure 15 shows the classification time as a function of 
the size of the file, ranging up to 10 KB. We used auto- 
matic feature selection, a 1-level classifier trained on .25 
of the hand-sorted dataset with no hard limit on feature 
counts to obtain this chart. This evaluation was per- 
formed on a classifier with over 4,000 features, and rep- 
resents the worst case performance for classification. We 
see that for a majority of files, classification can be per- 
formed in under 4 ms. Moreover, many contexts are in 
fact eval contexts, which are generally smaller than Java- 
Script files downloaded from the network. In the case of 
eval contexts such as that, the classification overhead is 
usually 1 ms and below. 

Figure 16 displays the overhead as a function of the 
number of classification features we used and compares 
it to the average parse time of .86 ms. Despite the fast 
feature matching algorithm presented in Section 3, hav- 
ing more features to match against is still quite costly. As 
a result, we see the average classification time grow sig- 
nificantly, albeit linearly, from about 1.6 ms for 30 fea- 
tures to over 7 ms for about 1,300 features. While these 
numbers are from our unoptimized implementation, we 
believe that ZOZZLE’s static detector has a lot of potential 
for fast on-the-fly malware identification. 


?The ZOZZLE false negative rate listed in Figure 14 is taken on 
our cross-validation experiment in Figure 12. 
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Figure 15: Classification time as a function of JavaScript file size. File size in bytes is shown on the & axis and the classification time in ms is 


shown on the y axis. 
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Figure 16: Classifier throughput and accuracy as a function of the 
number of features, using |-level classification with .25 of the train- 
ing set size. 


6 Discussion 


Caveats and limitations: All classifier-based malware 
detection tools will fail to detect some attacks, such as 
exploits that do not contain any of the features present 
in the training examples. More importantly, attackers 
who have a copy of ZOZZLE as an oracle can devise vari- 
ants of malware that are not detected by it. For example, 
they might rename variables, obscure strings by encod- 
ing them or breaking them into pieces, or substitute dif- 
ferent APIs that accomplish the same task. 

Evasion is made somewhat more difficult because any 
exploit that uses a known CVE must eventually make 
the necessary JavaScript runtime calls (e.g., detecting 
or loading a plugin) to trigger the exploit. If ZozzLE 
is able to statically detect such calls, it will detect the 
attempted exploit. To avoid such detection, an attacker 
might change the context in which these calls appear by 
creating local variables that reference the desired run- 
time function, an approach already employed by some 
exploits we have collected. 

In the future, for ZOZZLE to continue to be effective, 
it has to be adaptive against attempts to avoid detec- 
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tion. This adaptation takes two forms: improving its 
ability to reason about the malware, and adapting the 
feature set used to detect malware as it evolves. To im- 
prove ZOZZLE’s detection capability, it needs to incorpo- 
rate more semantic information about the JavaScript it 
analyzes. For example, as described in Section 3, feature 
flow could help ZozzLe identify attempts to obfuscate 
the use of APIs necessary for malware to be success- 
ful. Adapting ZozZLe’s feature set requires continuous 
retraining based on collecting malware samples detected 
by deploying other detectors such as NOZZLE. With such 
adaptation, ZOZZLE would dramatically reduce the effec- 
tiveness of the copy-and-pasted attacks that make up the 
majority of JavaScript malware today. In combination 
with complementary detection techniques, such as NOZ- 
ZLE, an updated feature set can be generated frequently 
with no human intervention. 


Just as with anti-virus, we believe that ZOZZLE is one 
of several measures that can be used as part of a defense- 
in-depth strategy. Moreover, our experience suggests that 
in many cases attackers are slow to adapt to the changing 
landscape. Despite the wide availability of obfuscation 
tools, in our NOZZLE detection experiments we still find 
many sites not using any form of obfuscation at all. We 
also see little diversity in the exploits collected. For ex- 
ample, the top five malicious scripts account for 75% of 
the malware detected. 


Deployment: The most attractive deployment strategy 
for ZOZZLE is in-browser deployment. ZOZZzLE has been 
designed to require only occasional offline re-training so 
that classifier updates can be shipped off to the browser 
every several days or weeks. Figure 17 shows a proposed 
workflow for ZOZZLE in-browser deployment. 


The code of the in-browser detector does not need to 
change, only the list of features and weights needs to 
be sent, similarly to updating signatures in an anti-virus 
product. Note that our detector is designed in a way that 
can be tightly integrated into the JavaScript parser, mak- 
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Figure 17: In-browser ZOZZLE deployment: workflow. 


ing malware “scoring” part of the overall parsing pro- 
cess; the only thing that needs to be maintained as the 
parse tree (AST) is being constructed is the set of match- 
ing features. This, we believe, will make the incremental 
overhead of ZOZZLE processing even lower than it is now. 
Another way to deploy ZoZZLE is as a filter for a more 
heavy-weight technique such as NOZZLE or some form 
of control- or dataflow integrity [1,5]. As such, the ex- 
pected end-user overhead will be very low, because both 
the detection rate of ZOZZLE and the rate of false posi- 
tives is very low; we assume if an attack is prevented, the 
user will not object to additional overhead in that case. 
Finally, ZozzLe is suitable for offline scanning, ei- 
ther in the case of dynamic web crawling using a web 
browser, or in the context or purely static scanning that 
exposes some part of the JavaScript code to the scanner. 


7 Related Work 


Several recent papers focusing on static detection tech- 
niques for malware, specifically implemented in Java- 
Script. None of the existing techniques propose integrat- 
ing malware classification with JavaScript execution in 
the context of a browser, as ZOZZLE does. 


7.1 Closely-related Malware Detection Work 


A quantitative comparison with closely related tech- 
niques is presented in Figure 18. It shows that ZOZZLE is 
heavily optimized for an extremely low rate of false pos- 
itives — about one in quarter million — with the closest 
second being CUJO [22] with six times as many false 
positives. 

ZOZZLE is generally faster than other tools, since the 
only runtime activity it performs is capturing JavaScript 
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Project Citation | FP rate FNrate | Static Dynamic 
ZOZZLE 3.1E-6  9.2E-2 v 3 
JSAND [6] 1.3E-5 2E-3 v v 
Prophiler [4] 9.8E-2 7.TE-3 v v 
CUJO [22] 2.0E-5 5.6E-2 v v 

















Figure 18: Quantitative comparison to closely related work. 


code. In its purely static mode, Cujo is also potentially 
quite fast, with running times ranging from .01 to 10 ms 
per URL, however, our rates are not directly comparable 
because URLs and code contexts are not one-to-one. 
Canali et al. [4] present Prophiler, a lightweight static 
filter for malware. It combines HTML-, JavaScript-, and 
URL-based features into one classifier that quickly fil- 
ters non-malicious pages so that malicious pages can be 
examined more extensively. While their approach has 
elements in common with ZozZzLE, there are also dif- 
ferences. First, ZozzLe focuses on classifying pages 
based on unobfuscated JavaScript code by hooking into 
the JavaScript engine entry point, whereas Prophiler ex- 
tracts its features from the obfuscated code. Second, 
ZOZZLE automatically extracts hierarchical features from 
the AST, whereas Prophiler relies on a variety of sta- 
tistical and lexical hand-picked features present in the 
HTML and JavaScript. Third, the emphasis of ZOZZLE 
is on very low false positive rates, whereas Prophiler, be- 
cause it is intended as a fast filter, allows higher false 
positive rates in order to reduce the false negative rate. 
Rieck et al. [22] describe Cujo, an system that com- 
bines static and dynamic features in a classifier frame- 
work based on support vector machines. They pre- 
process the source code into tokens and pass groups of 
tokens (Q-grams) to automatically extract Q-grams that 
are predictive of malicious intent. Unlike ZoZZLE, Cujo 
is proxy-based and uses JavaScript emulation instead of 
hooking into the JavaScript runtime in a browser. This 
emulation adds runtime overhead, but allows Cujo to use 
static as well as dynamic Q-grams in their classification. 
ZOZZLE differs from Cujo in that it uses the existing Java- 
Script runtime engine to unfold JavaScript contexts with- 
out requiring emulation reducing the overhead. 
Similarly, Cova et al. present a system JSAND that 
conducts classification based on static and dynamic fea- 
tures [6]. In Jsanp, potentially malicious JavaScript is 
emulated to determine runtime characteristics around de- 
obfuscation, environment preparation, and exploitation, 
such as the number of bytes allocated through string op- 
erations. These features are trained and evaluated with 
known good and bad URLs. Like Cujo, JsanpD uses em- 
ulation to combine a collection of static and dynamic 
features in their classification, as compared to ZOZZLE, 


3The only part of ZOZZLE that requires dynamic intervention is 
unfolding. 
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which extracts only static features automatically. Also, 
because ZOZZLE leverages the existing JavaScript en- 
gine unfolding process, JsanD performance is signifi- 
cantly slower than ZOZZLE. 


7.2 Other Projects 


Karanth et al. identify malicious JavaScript using a 
classifier based on hand-picked features present in the 
code [14]. Like us, they use known malicious and be- 
nign JavaScript files and train a classifier based on fea- 
tures present. They show that their technique can detect 
malicious JavaScript with high accuracy and they were 
able to detect a previously unknown zero-day vulnerabil- 
ity. Unlike our work, they do not integrate their classifier 
into the JavaScript engine, and so do not see the unfolded 
JavaScript as we do. 

High-interaction client honeypots have been at the 
forefront of research on drive-by-download attacks. 
Since they were first introduced in 2005, various stud- 
ies have been published [15,20,25,30-32]. High- 
interaction client honeypots drive a vulnerable browser 
to interact with potentially malicious web page and mon- 
itor the system for unauthorized state changes, such as 
new processes being created. The detection of drive-by- 
download attacks can also occur through the analysis of 
the content retrieved from the web server. When cap- 
tured at the network layer or through a static crawler, 
the content of malicious web pages is usually highly 
obfuscated opening the door to static feature based 
exploit detection [10, 20, 24,27, 28]. While these ap- 
proaches, among others, consider static JavaScript fea- 
tures, ZOZZLE is the first to utilize hierarchical features 
extracted from ASTs. 

Besides static features focusing on HTML and Java- 
Script, shellcode injection exploits also offer points for 
detection. Existing techniques such as Snort [23] use 
pattern matching to identify attacks in a database. Poly- 
morphic attacks that vary shellcode on each exploit at- 
tempt can avoid pattern-based detection unless improb- 
able properties of shellcode are used to detect such at- 
tacks, as in Polygraph [17]. Like ZozzLe, Polygraph uti- 
lizes a naive bayes classifier, but only applies it to the 
detection of shellcode. 

Abstract Payload Execution (APE) by Toth and 
Kruegel [29], STRIDE by Akritidis et al. [2, 18], and 
NOZZLE by Ratanaworabhan, Livshits and Zorn [21] all 
focus on analysis of the shellcode and NOP sled used by 
a heap spraying attack. Such techniques can detect heap 
sprays with low false positive rates, but incur higher run- 
time overhead than is acceptable for always-on deploy- 
ment in a browser (10-15% is farily common). 

Dynamic features have been the focus of several 
groups. Nazario, Buescher, and Song propose systems 
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that detect attacks on scriptable ActiveX components [3, 
16,26]. They capture JavaScript interactions and use 
vulnerability specific signatures to detect attacks. This 
method is effective in detecting attacks due to the rela- 
tive homogeneous characteristic of the attack landscape. 
However, while they are effective in detecting known ex- 
isting attacks on ActiveX components, they fail to iden- 
tify attacks that do not involve ActiveX components, 
which ZOZZLE is able to detect. 


8 Conclusions 


This paper presents ZOZZLE, a highly precise, mostly 
static detector for malware written in JavaScript. ZOZZLE 
is a versatile technology that is suitable for deployment 
in a commercial browser, staged with a more costly run- 
time detector like NozzL_e. Designing an effective in- 
browser malware detector requires overcoming techni- 
cal challenges that include achieving high performance, 
generating precise results, and overcoming attempts at 
obfuscating attacks. Much of the novelty of ZOZZLE 
comes from its hooking into the the JavaScript engine 
of a browser to get the final, expanded version of Java- 
Script code to address the issue of deobfuscation. Com- 
pared to other classifier-based tools, ZOZZLE uses contex- 
tual information available in the program abstract syntax 
tree (AST) to perform fast, scalable, yet precise malware 
detection. 

This paper contains an extensive evaluation of our 
techniques. We evaluated ZOZZLE in terms of perfor- 
mance and malware detection rates (both false posi- 
tives and false negatives) using over 1.2 million pre- 
categorized code samples. ZOZZLE has an extremely low 
false positive rate of 0.0003%, which is less than one in 
a quarter million. Despite this high accuracy, the ZOZZLE 
classifier is fast, with a throughput at over one megabyte 
of JavaScript code per second. 
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Shellcode obfuscation strategy Spray CVE 
unescape v 2009-0075 
unescape v 2009-1136 
unescape v 2010-0806 
unescape v 2010-0806 
none x 2010-0806 
hex, unescape v none 
replace, unescape x none 
unescape v 2009-1136 
replace, hex, unescape v 2010-0249 
custom, unescape v 2010-0806 
unescape v none 
replace, array v 2010-0249 
unescape v none 
unescape v 2009-1136 
replace, unescape x none 
replace, unescape v none 
unescape v 2010-0249 
unescape v 2010-0806 
hex, unescape v 2008-0015 
unescape x none 
replace, unescape x none 
unescape, array v 2010-0249 
replace, unescape v 2010-0806 
replace, unescape v 2010-0806 
replace, unescape v none 
replace, unescape x none 


Figure 19: Malware samples dissected and categorized. 
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Figure 20: Classification accuracy as a function of training set size for 
hand-picked and automatically selected features. 


A Hand-Analyzed Samples 


In the process of training the ZOZZLE classifier, we hand- 
analyzed a number of malware samples. While there is a 
great deal of duplication, there is a diversity of malware 
writing strategies found in the wild. 

Figure 19 provides additional details about each 
unique hand-analyzed sample. Common Vulnerabilities 
and Exposures (CVEs) are assigned when new vulnera- 
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bilities are discovered and verified, and these identifiers 
are listed for all the exploits in Figure 19 that target some 
vulnerability. Shellcode and nopsled type describe the 
method by which JavaScript or HTML values are con- 
verted to the binary data that is sprayed throughout the 
heap. Most shellcode and nopsleds are written as hex- 
adecimal literals using the \x escape sequence. These 
cases are denoted by “hex” in Figure 19. 

Many scripts use the %u encoding and are converted 
to binary data with the JavaScript unescape function. Fi- 
nally, some samples include short fragments inserted re- 
peatedly (such as the string cuTE, which appears in sev- 
eral examples) that are removed or replaced by a call to 
the JavaScript replace function. 

In a few cases, the exploit sample does not contain 
one or more of the components of a heap spray attack 
(shellcode, spray, and vulnerability). In these cases, the 
script is delivered with one or more of the other samples 
for which it may provide shellcode, perform a spray, or 
trigger a vulnerability. 


B_ Additional Experimental Data 


Training set size: To understand the impact of training 
set size on accuracy and false positive/negative rates, we 
trained classifiers using between 1% and 25% of our be- 
nign and malicious datasets. For each training set size, 
ten classifiers were trained using different randomly se- 
lected subsets of the dataset for both hand-picked and au- 
tomatic features. These classifiers were evaluated with 
respect to overall accuracy in Figure 20 and false posi- 
tives/negatives in Figure 21a. 

The figures show that training set size does have an 
impact on the overall accuracy and error rates, but that 
a relative small training set (< 5% of the overall data 
set) is sufficent to realize most of the benefit. The false 
positive negative rate using automatic feature selection 
benefits the most from additional training data, which is 
explained by the fact that this classifier has many more 
features and benefits from more examples to fully train. 


Feature set size: To understand the impact of feature set 
size on classifier effectiveness, we trained the 1-level au- 
tomatic classifier, sorted the selected features by their 7 
value, and picked only the top NV features. For this ex- 
periment (due to the fact that the training set used is ran- 
domly selected), there were a total of 1,364 features orig- 
inally selected during automatic selection. 

Figure 21b shows how the false positive and false neg- 
ative rates vary as we change the size of the feature set to 
contain 500, 300, 100, and 30 features, respectively. 

The figures show that the false positive rate remains 
low (and drops to 0 in some cases) as we vary the feature 
set size. Unfortunately, the false negative rate increases 


20th USENIX Security Symposium 47 


48 





4% 
---8--- Hand-Picked 


-#. —— Automatic 








False Positive Rate 
N 
x 
® 





BR 
& 








: he I ag: 


0% 5% 10% 15% 20% 25% 
Training Set Fraction 





25% ===l=-- Hand-Picked 


—— Automatic 





N 
3 
& 





15% 





10% 


False Negative Rate 





5% 








0% T i i 71 
0% 5% 10% 15% 20% 25% 


Training Set Fraction 


(a) as a function of training set size. 


0.25% 





0.20% 











False Positive Rate 
Oo o 
ib ib 
So uw 
x x 








0.05% 
0.00% + | T T 1 T 1 
0 300 600 900 1200 1500 


Features 





25% 4 





20% 





15% 





10% 





False Negative Rate 


5% 








0% T T T T 
0 300 600 900 1200 1500 


Features 





(b) as a function of feature set size. 


Figure 21: False positive and false negative rates. 


function dF(s) 
{ 
var sl = unescape(s.substr(0,s.length - 1)), 
t=, 
for(i = 0; i < sl.length; i++) 

t += String. fromCharCode ( 
sl.charCodeAt (i) — 
s.substr(s.length - 1,1)); 

document .write (unescape (t) ) 


Figure 22: Code unpacker detected by anti-virus tools. 


steadily with smaller feature set sizes. The implication 
is that while a small number of features can effectively 
identify some malware (probably the most commonly 
observed malware), many of the most obscure malware 
samples will remain undetected if the feature set is too 
small. 


C_ Additional Code Samples 


Figure 22 shows a code unpacker that is incorrectly 
flagged by overly eager anti-virus engines. Of course, 
the unpacker code itself is not malicious, even though the 
contents it may unpack could be malicious. Finally, Fig- 
ure 23 shows an example of code that anti-virus engines 
overeagerly deem as malicious. 
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document . write (unescape (’ $3C%73%63...')); 

dF (’ 3264Dtds jqu%264Fepdvnfou/xs ju£%2639 
%$2633%264Dtds jqu%2631tsd%264E%266D%2633%2633 
%2C%2633iuug%264B00jutbmmcsfbltpgu/ofuduet0jo/ 
dhj%264G63%2637tfpsfg%264E%2633 
%2Cf£odpefVSJDpngqpofou%2639epdvnfou/sfgfssfs 
%$263%3A%2C%2633%2637qbsbnfufs%264E 
$26351fzxpse%2637t£%264E%2635t£%2637vs 
%$264E2%26371UU0% 60SFGFSFS%264E%2633%2C 
%2631f0dpefVSJDpnqpofou%2639epdvnfou/VSM 
%$263%3A%2C%2633%2637efgbvmus601fzxpse 
%264Eopuefg jo£%2633%2C%2633%266D%2633 
%$264F%264D%266D0tds jqus264F%2633%263%3A 
%264C%264D0tds jqu%264F%261B%264Dtds jqu%s264F 
%$261Bjg%263 9uzqfpg%2639i%263%3A%264E 
%$264E%2633voefg jofe%2633%263%3A%268C%261 
%3A%261B%261%3Aepdvnfou/xs juf%2639%2633 
%$264Djgsbnf£%2631tsd%264E%2638iuug 
%264B00jutbmmcsfbltpgu/ofu0uet 0 jo/dhj%264G4 
%$2637tfpsfg%264E%2633%2CfodpefVSJDpnqpofou 
%2639epdvnfou/sfgfssfs%263%3A%2C%2633 
%$2637qbsbnfufs%264E%2635l1fzxpses2637t£ 
$264E%2635t£%2637vs%264E2%2637IUUQ%60SFGFSFS 
%$264E%2633%2C%2631fodpefVSJDpnqpofou 

%263 9epdvnfou/VSM%2 63%3A%2C%2633%2637efgbvmu 
%601lfzxpse%2 64Eopuefg jof%2638%2631xjeui 
%264E2%2631if jhiu%264E2%2631cpsefs%264E1 
%2631lgsbnfcpsefs%264E1%264F%264D0 jgsbnf£ 
$264F%2633%263%3A%264C%2631%261B%268E%261Bimtf£ 
%$263135g9%2639i/ joefyPg%2639%2633iuuq 
%$264B%2633%263%3A%2 64E%264E1%263%3A%268C%261B%261 
%3A%261%3Ax joepx/mpdbu jpo%2 64Ei%264C%261B 
%268E%261B%264D0tds jqus264F1’ ) 


Figure 23: Anti-virus false positive. A portion of the file after 
unescape is removed to avoid triggering AV on the final PDF of this 


paper. 
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APCO Project 25 (“P25’’) is a suite of wireless com- 
munications protocols used in the US and elsewhere for 
public safety two-way (voice) radio systems. The proto- 
cols include security options in which voice and data traf- 
fic can be cryptographically protected from eavesdrop- 
ping. This paper analyzes the security of P25 systems 
against both passive and active adversaries. We found a 
number of protocol, implementation, and user interface 
weaknesses that routinely leak information to a passive 
eavesdropper or that permit highly efficient and difficult 
to detect active attacks. We introduce new selective sub- 
frame jamming attacks against P25, in which an active 
attacker with very modest resources can prevent specific 
kinds of traffic (such as encrypted messages) from be- 
ing received, while emitting only a small fraction of the 
aggregate power of the legitimate transmitter. We also 
found that even the passive attacks represent a serious 
practical threat. In a study we conducted over a two year 
period in several US metropolitan areas, we found that 
a significant fraction of the “encrypted” P25 tactical ra- 
dio traffic sent by federal law enforcement surveillance 
operatives is actually sent in the clear, in spite of their 
users’ belief that they are encrypted, and often reveals 
such sensitive data as the names of informants in crimi- 
nal investigations. 


1 Introduction 


APCO Project 25 [16] (also called “P25”’) is a suite of 
digital protocols and standards designed for use in nar- 
rowband short-range (VHF and UHF) land-mobile wire- 
less two-way communications systems. The system is 
intended primarily for use by public safety and other gov- 
ernment users. 

The P25 protocols are designed by an international 
consortium of vendors and users (centered in the United 
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States), coordinated by the Association of Public Safety 
Communications Officers (APCO) and with its standards 
documents published by the Telecommunications Indus- 
try Association (TIA). Work on the protocols started in 
1989, with new protocol features continuing to be refined 
and standardized on an ongoing basis. 

The P25 protocols support both digital voice and low 
bit-rate data messaging, and are designed to operate in 
stand-alone short range “point-to-point” configurations 
or with the aid of infrastructure such as repeaters that 
can cover larger metropolitan and regional areas. 

P25 supports a number of security features, including 
optional encryption of voice and data, based on either 
manual keying of mobile stations or “over the air” rekey- 
ing (““OTAR” [15]) through a key distribution center. 

In this paper, we examine the security of the P25 
(and common implementations of it) against unautho- 
rized eavesdropping, passive and active traffic analysis, 
and denial-of-service through selective jamming. 

This paper has three main contributions: First, we 
give an (informal) analysis of the P25 security protocols 
and standard implementations. We identify a number of 
limitations and weaknesses of the security properties of 
the protocol against various adversaries as well as am- 
biguities in the standard usage model and user interface 
that make ostensibly encrypted traffic vulnerable to unin- 
tended and undetected transmission of cleartext. We also 
discovered an implementation error, apparently common 
to virtually every current P25 product, that leaks station 
identification information in the clear even when in en- 
crypted mode. 

Next, we describe a range of practical active attacks 
against the P25 protocols that can selectively deny ser- 
vice or leak location information about users. In partic- 
ular, we introduce a new active denial-of-service attack, 
selective subframe jamming, that requires more than an 
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order of magnitude less average power to effectively jam 
P25 traffic than the analog systems they are intended to 
replace. These attacks, which are difficult for the end- 
user to identify, can be targeted against encrypted traffic 
(thereby forcing the users to disable encryption), or can 
be used to deny service altogether. The attack can be 
implemented in very simple and inexpensive hardware. 
We implemented a complete receiver and exciter for an 
effective P25 jammer by installing custom firmware in a 
$15 toy “instant messenger” device marketed to pre-teen 
children. 

Finally, we show that unintended transmission of 
cleartext commonly occurs in practice, even among 
trained users engaging in sensitive communication. We 
analyzed the over-the-air P25 traffic from the secure 
two-way radio systems used by federal law enforcement 
agencies in several metropolitan areas over a two year 
period and found that a significant fraction of highly sen- 
sitive “encrypted” communication is actually sent in the 
clear, without detection by the users. 


2 P25 Overview 


P25 systems are intended as an evolutionary replace- 
ment for the two-way radio systems used by local public 
safety agencies and national law enforcement and intel- 
ligence services. Historically, these systems have used 
analog narrowband FM modulation. Users (or their ve- 
hicles) typically carry mobile transceivers! that receive 
voice communications from other users, with all radios 
in a group monitoring a common broadcast channel. P25 
was designed to be deployed without significant change 
to the user experience, radio channel assignments, spec- 
trum bandwidth used, or network topology of the legacy 
analog two-way radio systems they replace, but adding 
several features made possible by the use of digital mod- 
ulation, such as encryption. 

Mobile stations (in both P25 and legacy analog) are 
equipped with “Push-To-Talk” buttons; the systems are 
half duplex, with at most one user transmitting on a given 
channel at a time. The radios typically either constantly 
receive on a single assigned channel or scan among mul- 
tiple channels. P25 radios can be configured to mute re- 
ceived traffic not intended for them, and will ignore re- 
ceived encrypted traffic for which a correct decryption 
key is not available. 

P25 mobile terminal and infrastructure equipment is 
manufactured and marketed in the United States by 


' Various radio models are designed be installed permanently in ve- 
hicles or carried as portable battery-powered “walkie-talkies”’. 
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Figure 1: Motorola XTS5000 Handheld P25 Radio 


a number of vendors, including E.F. Johnson, Har- 
ris, Icom, Motorola, RELM Wireless and Thales/Racal, 
among others. The P25 standards employ a number of 
patented technologies, including the voice codec, called 
IMBE [17]. Cross-licensing of patents and other tech- 
nology is standard practice among the P25 equipment 
vendors, resulting in various features and implementa- 
tion details common among equipment produced by dif- 
ferent manufacturers. Motorola is perhaps the dominant 
U.S. vendor, and in this paper, we use Motorola’s P25 
product line to illustrate features, user interfaces, and at- 
tack scenarios. A typical P25 handheld radio is shown in 
Figure 1. 

For compatibility with existing analog FM based ra- 
dio systems and for consistency with current radio spec- 
trum allocation practices, P25 radios use discrete narrow- 
band radio channels (and not the spread spectrum tech- 
niques normally associated with digital wireless commu- 
nication). 

Current P25 radio channels occupy a standard 12.5 
KHz “slot” of bandwidth in the VHF or UHF land mo- 
bile radio spectrum. P25 uses the same channel alloca- 
tions as existing legacy narrowband analog FM two-way 
radios. To facilitate a gradual transition to the system, 
P25-compliant radios must be capable of demodulating 
legacy analog transmissions, though legacy analog radios 
cannot, of course, demodulate P25 transmissions. 

In the current P25 digital modulation scheme, called 
C4FM, the 12.5kHz channel is used to transmit a four- 
level signal, sending two bits with each symbol at a 


USENIX Association 


rate of 4800 symbols per second, for a total bit rate of 
9600bps.” 

P25 radio systems can be configured for three differ- 
ent network topologies, depending on varying degrees of 
infrastructural support in the area of coverage: 


e Simplex configuration: All group members set 
transmitters and receiver to receive and broadcast on 
the same frequency. The range of a simplex system 
is the area over which each station’s transmissions 
can be received directly by the other stations, which 
is limited by terrain, power level, and interference 
from co-channel users. 


e Repeater operation: Mobile stations transmit on one 
frequency to a fixed-location repeater, which in turn 
retransmits communications on a second frequency 
received by all the mobiles in a group. Repeater 
configurations thus use two frequencies per chan- 
nel. The repeater typically possesses both an advan- 
tageous geographical location and access to electri- 
cal power. Repeaters extend the effective range of 
a system by rebroadcasting mobile transmissions at 
higher power and from a greater height 


e Trunking: Mobile stations transmit and receive on a 
variety of frequencies as orchestrated by a “control 
channel” supported by a network of base stations. 
By dynamically allocating transmit and receive fre- 
quencies from among a set of allocated channels, 
scarce radio bandwidth may be effectively time 
and frequency domain multiplexed among multiple 
groups of users. 


For simplicity, this paper focuses chiefly on weak- 
nesses and attacks that apply to all three configurations. 

As P25 is a digital protocol, it is technically straight- 
forward to encrypt voice and data traffic, something that 
was far more difficult in the analog domain systems it 
is designed to replace. However, P25 encryption is an 
optional feature, and even radios equipped for encryp- 
tion still have the capability to operate in the clear mode. 
Keys may be manually loaded into mobile units or may 
be updated at intervals using the OTAR protocol. 

P25 also provides for a low-bandwidth data stream 
that piggybacks atop voice communications, and for a 
higher bandwidth data transmission mode in which data 


?This 12.5 KHz “Phase 1” modulation scheme is designed to co- 
exist with analog legacy systems. P25 also specifies a quadrature phase 
shift keying and TDMA and FMDA schemes that uses only 6.25kHz of 
spectrum. These P25 “Phase 2” modulation systems have not yet been 
widely deployed, but in any case do not affect the security analysis in 
this paper. 
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is sent independent of voice. (It is this facility which en- 
ables the OTAR protocol, as well as attacks we describe 
below to actively locate mobile users.) 


2.1 The P25 Protocols 


This section is a brief overview of the most salient fea- 
tures of the P25 protocols relevant to rest of this paper. 
The P25 protocols are quite complex, and the reader is 
urged to consult the standards themselves for a complete 
description of the various data formats, options, and mes- 
sage flows. An excellent overview of the most important 
P25 protocol features can be found in reference [6]. 

The P25 Phase 1 (the currently deployed version) RF- 
layer protocol uses a four level code over a 12.5kHz 
channel, sending two bits per transmitted symbol at 4800 
symbols per second or 9600 bits per second. 

A typical transmission consists of a series of frames, 
transmitted back-to-back in sequence. The start of each 
frame is identified by a special 24 symbol (48 bit) frame 
synchronization pattern. 

This is immediately followed by a 64 bit field contain- 
ing 16 bits of information and 48 bits of error correction. 
12 bits, the NAC field, identify the network on which the 
message is being sent — a radio remains muted unless 
a received transmission contains the correct NAC, which 
prevents unintended interference by distinct networks us- 
ing the same set of frequencies. 4 bits, the DUID field, 
identify the type of the frame. Either a voice header, 
a voice superframe, a voice trailer, a data packet, or a 
trunked frame. All frames but the packet data frames are 
of fixed length. 

Header frames contain a 16 bit field designating the 
destination talk group TGID for which a transmission is 
intended. This permits radios to mute transmissions not 
intended for them. The header also contains information 
for use in encrypted communications, specifically an ini- 
tialization vector (designated the Message Indicator or 
MI in P25, which is 72 bits wide but effectively only 64 
bits), an eight bit Algorithm ID, and a 16 bit Key ID. 
Transmissions in the clear set these fields to all zeros. 
This information is also accompanied by a large number 
of error correction bits. 

The actual audio payload, encoded as IMBE voice 
subframes, is sent inside Link Data Units (LDUs). A 
voice LDU contains a header followed by a sequence of 
nine 144 bit IMBE voice subframes (each of which en- 
codes 20ms of audio, for a total 180ms of encoded au- 
dio in each LDU frame), plus additional metadata and 
a small amount of piggybacked low speed data. Each 
LDU, including headers, metadata, voice subframes, and 
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Figure 2: P25 Voice Transmission Framing (from Project 25 FDMA - Common Air Interface: TIA-102.BAAA-A) 


error correction is 864 symbols (1728 bits) long. 

A voice transmission thus consists of a header frame 
followed by an arbitrary length alternating sequence of 
LDU frames in two slightly different formats (called 
LDU1 and LDU2 frames, which differ in the metadata 
they carry), followed by a terminator frame. See Fig- 
ure 2. Note that the number of voice LDU1 and LDU2 
frames to be sent in a transmission is not generally 
known at the start of the transmission, since it depends 
on how long the user speaks. 

LDU1 frames contain the source unit ID of a given 
radio (a 24 bit field), and either a 24 bit destination unit 
ID (for point to point transmissions) or a 16 bit TGID 
(for group transmissions). 

LDU2 frames contain new MI, Algorithm ID and 
Key ID fields. Voice LDU frames alternate between 
the LDU1 and LDU2 format. Because all the metadata 
required to recognize a transmission is available over 
the course of two LDU frames, a receiver can use an 
LDU1/LDU2 pair (also called a “superframe’’), to “catch 
up with” a transmission even if the initial transmission 
header was missed. 

See Figure 3 for the structure of the LDU1 and LDU2 
frames. 

Terminator units, which may follow either an LDU1 
or LDU2 frame, indicate the end of a transmission. 

A separate format exists for (non-voice) packet data 
frames. Data frames may optionally request acknowl- 
edgment to permit immediate retransmission in case of 
corruption. A header, which is always unencrypted, in- 
dicates which unit ID has originated the packet or is its 
target. (These features will prove important in the dis- 
cussion of active radio localization attacks.) 

Trunking systems also use a frame type of their own 
on their control channel. (We do not discuss the details 
of this frame type, as they are not relevant to our study.) 

It is important to note a detail of the error correction 
codes used for the voice data in LDU1 and LDU2 frames. 
The IMBE codec has the feature that not all bits in the 
encoded representation are of equal importance in regen- 
erating the original transmitted speech. To reduce the 
amount of error correction needed in the frame, bits that 
contribute more to intelligibility receive more error cor- 
rection than those that contribute less, with the least im- 
portant bits receiving no error correction at all. Although 
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2 bits after every 70 bits 
Figure 8-3 Logical Link Data Unit 1 
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2 bits after every 70 bits 
Figure 8-4 Logical Link Data Unit 2 


Figure 3: Logical Data Unit structure (from Project 25 
FDMA - Common Air Interface: TIA-102.BAAA-A) 


this means that the encoding of voice over the air is more 
efficient, it also means that voice transmissions are not 
protected by with block ciphers or message authentica- 
tion codes, as we explain below. 


2.2 Security Features 


P25 provides options for traffic confidentiality using 
symmetric-key ciphers, which can be implemented in 
software or hardware. The standard supports mass- 
market “Type 2/3/4” crypto engines (such as DES and 
AES) for unclassified domestic and export users, as well 
as NSA-approved “Type 1” cryptography for govern- 
ment classified traffic. (The use of Type | hardware is 
tightly controlled and restricted to classified traffic only; 
even sensitive criminal law enforcement surveillance op- 
erations typically must use commercial Type 2/3/4 cryp- 
tography.) 

The DES, 3DES and AES ciphers are specified in the 
standard, in addition to the null cipher for cleartext. The 
standard also provides for the use of vendor-specific pro- 
prietary algorithms (such as 40 bit RC4 for radios aimed 
at the export market). [13] 
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At least for unclassified Type 2, 3 and 4 cryptography, 
pre-shared symmetric keys are used for all traffic encryp- 
tion. The system requires a key table located in each 
radio mapping unique Key [D+Algorithm ID tuples to 
particular symmetric cipher keys stored within the unit. 
This table may be keyed manually or with the use of an 
Over The Air Rekeying protocol. A group of radios can 
communicate in encrypted mode only if all radios share 
a common key (labeled with the same Key ID). 

Many message frame types contain a tuple consisting 
of an initialization vector (the MI), a Key ID and an Al- 
gorithm ID. A clear transmission is indicated by a zero 
MI and KID and a special ALGID. The key used by a 
given radio group may thus change from message to mes- 
sage and even from frame to frame (some frames may be 
sent encrypted while others are sent in the clear). 

Because of the above-described property of the error 
correction mechanisms used, especially in voice frames 
such as the LDU1 and LDU2 frame types, there is no 
mechanism to detect errors in certain portions of trans- 
mitted frames. This was a deliberate design choice, to 
permit undetected corruption of portions of the frame 
that are less important for intelligibility. 

This error-tolerant design means that standard block 
cipher modes (such as Cipher Block Chaining) cannot be 
used for voice encryption; block ciphers require the ac- 
curate reception of an entire block in order for any por- 
tion of the block to be correctly decrypted. P25 voice 
encryption is specified stream ciphers, in which a cryp- 
tographic keystream generator produces a pseudorandom 
bit sequence that is XORd with the data stream to encrypt 
(on the transmit side) and decrypt (on the receive side). 
In order to permit conventional block ciphers (including 
DES and AES) to be used as stream ciphers, they are run 
in Output Feedback mode (“OFB”)) in order to gener- 
ate a keystream. (Some native stream ciphers, such as 
RC4, have also been implemented by some manufactur- 
ers, particularly for use in export radios that limited to 
short key lengths.) 

For the same reason — received frames must tolerate 
the presence of some bit errors — cryptographic message 
authentication codes (“MACs”), which fail if any bit er- 
rors whatsoever are present, are not used.? 


3 Security Deficiencies 


In the previous section, we described a highly ad hoc, 
constrained architecture that, we note, departs in signif- 


3Some vendors support AES in GCM mode, but it is not standard- 
ized. In any case, even when GCM mode is used, it does not authenti- 
cate the voice traffic as originating with a particular user. 
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icant ways from conservative security design, does not 
provide clean separation of layers, and lacks a clearly 
stated set of requirements against which it can be tested. 

This is true even in portions of the architecture, such 
as the packet data frame subsystem, which are at least in 
theory compatible with well understood standard crypto- 
graphic protocols, such as those based on block ciphers 
and MACs. 

This ad hoc design might by itself represent a security 
concern. In fact, the design introduces significant certifi- 
cational weaknesses in the cryptographic protection pro- 
vided. 

But such weaknesses do not, in and of themselves, 
automatically result in exploitable vulnerabilities. How- 
ever, they weaken and complicate the guarantees that can 
be made to higher layers of the system. Given the over- 
all complexity of the P25 protocol suite, and especially 
given the reliance of upper layers such as the OTAR sub- 
system on the behavior of lower layers, such deficiencies 
make the security of the overall system much harder for 
a defender to analyze. 

The P25 implementation and user interfaces, too, suf- 
fer from an ad hoc design that, we shall see, does not fare 
well against an adversarial threat. There is no evidence in 
the standards documents, product literature, or other doc- 
umentation of user interface or usability requirements, or 
of testing procedures such as “red team” exercises or user 
behavior studies. 

As we shall see later in this paper, taken in combina- 
tion, the design weaknesses of the P25 security architec- 
ture and the standard implementations of it admit practi- 
cal, exploitable vulnerabilities that routinely leak sensi- 
tive traffic and that allow an active attacker remarkable 
leverage. 

At the root of many of the most important practical 
vulnerabilities in P25 systems are a number of funda- 
mentally weak cryptographic, security protocol, and cod- 
ing design choices. 


3.1 Authentication and Error Correction 


A well known weakness of stream ciphers is that attack- 
ers who know the plaintext content of any encrypted por- 
tion of transmission may make arbitrary changes to that 
content at will simply by flipping appropriate bits in the 
data stream. For this reason, it is usually recommended 
that stream ciphers be used in conjunction with MACs. 
But the same design decision (error tolerance) that forced 
the use of stream ciphers in P25 also precludes the use of 
MACs, 


Because no MACs are employed on voice and most 
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other traffic, even in encrypted mode, it is trivial for an 
adversary to masquerade as a legitimate user, to inject 
false voice traffic, and to replay captured traffic, even 
when all radios in a system have encryption configured 
and enabled. 

The ability for an adversary to inject false traffic with- 
out detection is, of course, a fundamental weakness by it- 
self, but also something that can serve as a stepping stone 
to more sophisticated attacks (as we shall see later). 

A related issue is that because the P25 voice mode is 
real time, it relies entirely on error correction (rather than 
detection and retransmission) for integrity. The error cor- 
rection scheme in the P25 frame is highly optimized for 
the various kinds of content in the frame. In particular, 
a single error correcting code is not used across the en- 
tire frame. Instead, different sections of P25 frames are 
error corrected in independent ways, with separate codes 
providing error correction for relatively small individual 
portions of the data stream. This design leaves the frames 
vulnerable to highly efficient active jamming attacks that 
target small-but-critical subframes, as we will see in Sec- 
tion 4. 


3.2 Unencrypted Metadata 


Even when encryption is used, much of the basic meta- 
data that identifies the systems, talk groups, sender and 
receiver user IDs, and message types of transmissions are 
sent in the clear and are directly available to a passive 
eavesdropper for traffic analysis and to facilitate other 
attacks. While some of these fields can be optionally en- 
crypted (the use of encryption is not tied to whether voice 
encryption is enabled), others must always be sent in the 
clear due to the basic architecture of P25 networks. 

For example, the start of every frame of every trans- 
mission includes a Network Identifier (“NID”) field that 
contains the 12 bit Network Access Code (NAC) and the 
4 bit frame type (“Data Unit ID”). The NAC code ident- 
fies the network on which the transmission is being sent; 
on frequencies that carry traffic from multiple networks, 
it effectively identifies the organization or agency from 
which a transmission originated. The Data Unit ID iden- 
tifies the type of traffic, voice, packet data, etc. Several 
aspects of the P25 architecture requires that the NID be 
sent in the clear. For example, repeaters and other infras- 
tructure (which do not have access to keying material) 
use it to control the processing of the traffic they receive. 
The effect is that the NAC and type of transmission is 
available to a passive adversary on every transmission. 

For voice traffic, a Link Control Word (“LCW’) is in- 
cluded in every other LDU voice frame (specifically, in 
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the LDUI frames). The LCW includes the transmitter’s 
unique unit ID (somewhat confusingly called the “Link 
IDs” in various places in the standard). The ID fields in 
the LCW can be optionally encrypted, but whether they 
are actually encrypted is not intrinsically tied to whether 
encryption is enabled for the voice content itself (rather 
it is indicated by a “protected” bit flag in the LCW). 

Worse, we discovered a widely deployed implementa- 
tion error that exacerbates the unit ID information leaked 
in the LCW. We examined the transmitted bitstream gen- 
erated by Motorola P25 radios in our laboratory, and also 
the over-the-air tactical P25 traffic on the frequencies 
used by Federal law enforcement agencies in several US 
metropolitan areas (captured over a period of more than 
one year) 

We found that in every P25 transmission we captured, 
both in P25 transmissions sent from our equipment and 
from encrypted traffic we intercepted over the air, the 
LCW protection bit is never set; the option to encrypt 
the LCW does not appear ever to be enabled, even when 
the voice traffic itself is encrypted. That is, in both Mo- 
torola’s XTS5000 product and, apparently, in virtually 
every other P25 radio in current use by the Federal gov- 
ernment, the sender’s Unit Link ID is always sent in the 
clear, even for encrypted traffic. This, of course, greatly 
facilitates traffic analysis of encrypted networks by a pas- 
sive adversary, who can simply record the unique identi- 
fiers of each transmission as it comes in. It also simplifies 
certain active attacks we discuss in the section below. 


3.3. Traffic Analysis and Active Location 
Tracking 


Generally, a radio’s location may be tracked only if 
it is actively transmitting. Standard direction find- 
ing techniques can locate a transmitting radio relatively 
quickly [12, 10]. P25 provides a convenient means for 
an attacker to induce otherwise silent radios to transmit, 
permitting active continuous tracking of a radio’s user. 

The P25 protocol includes a data packet transmission 
subsystem (this is separate from the streaming real-time 
digital voice mode we have been discussing). P25 data 
packets may be sent in either an unconfirmed mode, in 
which retransmission in the event of errors is handled by 
a higher layer of the protocol, or in confirmed mode, in 
which the destination radio must acknowledge successful 
reception of a data frame or request that it be retransmit- 
ted. 

If the Unit Link IDs used by a target group are already 
known to an adversary, she may periodically direct in- 
tentionally corrupted data frames to each member of the 
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group. Only the header CRCs need check cleanly for a 
data frame to be replied to — the rest of the packet can 
be (intentionally) corrupt. Upon receiving a corrupt data 
transmission directed to it, the target radio will immedi- 
ately reply over the air with a retransmission request. (It 
is unlikely that such corrupted data frames will be no- 
ticed, especially since the corrupt frames are rejected be- 
fore being passed to the higher layers in the radio’s soft- 
ware responsible for performing decryption and display- 
ing messages on the user interface). The reply transmis- 
sion thus acts as an oracle for the target radio that not 
only confirms its presence, but that can be used for di- 
rection finding to identify its precise location. 

While we are unaware of any P25 implementations 
that refuse to respond to a data frame that is not prop- 
erly encrypted, even if encryption is enabled and a ra- 
dio refuses to pass unencrypted frames to higher level 
firmware, the attacker may easily construct a forged but 
valid encryption auxiliary header simply by capturing le- 
gitimate traffic and inserting a stolen encryption header. 
This is possible because the protocol is optimized to re- 
cover from interference and transmission errors. Upon 
receiving a damaged packet — whether generated by an 
attacker or corrupted from natural causes — the target ra- 
dio sends a message to request retransmission. This has 
the effect of allowing an active adversary to use the data 
protocol as an oracle for a given radio’s presence. It also 
allows an adversary to force a target radio to transmit on 
command, allowing direction finding on demand. 

If the target radios’ Unit Link IDs are for some reason 
unknown to the attacker, she may straightforwardly at- 
tempt a “wardialing” attack in which she systematically 
guesses Unit Link IDs and sends out requests for replies, 
taking note of which ID numbers respond. However, in 
a trunked system or a system using Over the Air Rekey- 
ing, or in a system where members of the radio group 
occasionally transmit voice in the clear, Link IDs will be 
readily available without resorting to wardialing in this 
manner. 

With this technique, an adversary can easily “turn the 
tables” on covert users of P25 mobile devices, effectively 
converting their radios into location tracking beacons. 


3.4 Clear Traffic Always Accepted 


All models of P25 radios of which we are aware will 
receive any traffic sent in the clear even when they are 
in encrypted mode. There is no configuration option to 
reject or mute clear traffic. While this may have some 
benefit to ensure interoperability in emergencies, it also 
means that a user who mistakenly places the “secure” 
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Figure 4: Motorola KVL3000 Keyloader with XTS5000 
Radio 


switch in the “clear” position is unlikely to detect the 
error. 

Because it is difficult to determine that one is receiving 
an accidentally non-encrypted signal, messages from a 
user unintentionally transmitting in the clear will still be 
received by all group members (and anyone else eaves- 
dropping on the frequency), who will have no indication 
that there is a problem unless they happen to be actively 
monitoring their receivers’ displays during the transmis- 
sion. 

Especially in light of the user interface issues dis- 
cussed in Section 3.6, P25’s cleartext acceptance policy 
invites a practical scenario for cleartext to be sent with- 
out detection for extended periods. If some encrypted 
users accidentally set their radios for clear mode, the 
other users will still hear them. And as long as the (mis- 
takenly) clear users have the correct keys, they will still 
hear their cohorts’ encrypted transmissions, even while 
their own radios continue transmitting in the clear. 


3.5 Cumbersome Keying 


The P25 key management model is based on centralized 
control. As noted above, in most secure P25 products 
(including Motorola’s), key material is loaded into radios 
either via a special key variable loader (that is physically 
attached by cable to the radio; see Figure 4) or through 
the OTAR protocol (via a KMF server on the radio net- 
work). 
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There is no provision for individual groups of users 
to create ad hoc keys for short term or emergency use 
when they find that some members of a group lack the 
key material held by the others. That is, there is no 
mechanism for peers to engage in public key negotiation 
among themselves over the air or for keys to be entered 
into radios by hand without the use of external keyloader 
hardware. 

Thus there is no way for most users in the field to add a 
new member to the group or to recover if one user’s radio 
is discovered to be missing the key during a sensitive op- 
eration. In systems that use automatic over-the-air key- 
ing at regular intervals, this can be especially problem- 
atic. If common keys get “out of sync” after some users 
have updated keys before others have, all users must re- 
vert to clear mode for the group to be able to communi- 
cate.* As we will see in the next section, this is a com- 
mon scenario in practice. 


3.6 User Interface Ambiguities 


P25 mobile radios are intended to support a range of gov- 
ernment and public safety applications, many of which, 
such as covert law enforcement surveillance, require both 
a high degree of confidentiality as well as usability and 
reliability. 

While a comprehensive analysis of the user interface 
and usability of P25 radios is beyond the scope of this 
paper, we found a number of usability deficiencies in the 
P25 equipment we examined. 

As noted above, the security features of P25 radios as- 
sume a centrally-controlled key distribution infrastruc- 
ture shared by all users in a system. Once cryptographic 
keys have been installed in the mobile radios, either by a 
manual key loading device or through OTAR, the radios 
are intended to be simple to operate in encrypted mode 
with little or no interaction from the user. Unfortunately, 
we found that the security features are often difficult to 
use reliably in practice.> 

All currently produced P25 radios feature highly con- 
figurable user interfaces. Indeed, most vendors do not 
impose any standard user interface, but rather allow the 


4This scenario is a sharp counterexample to the oft-repeated crypto- 
graphic folk wisdom (apparently believed as an article of faith by many 
end users) that frequently changing one’s keys yields more security. 

5Tn this section, we focus on examples drawn from Motorola’s P25 
product line. Motorola is a major vendor of P25 equipment in the 
United States and elsewhere, supplying P25 radios to the federal gov- 
ernment as well as state and local agencies. Other vendors’ radios have 
similar features; we use the Motorola products strictly for illustration. 
We performed some of our experiment with a small encrypted P25 
network we set up in our laboratory, using a set of Motorola Model 
XTS5000 handheld radios. 
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radio’s buttons, switches and “soft”? menus to be cus- 
tomized by the customer. While this may seem an advan- 
tageous feature that allows each customer to configure 
its radios to best serve its application, the effect of this 
highly flexible design is that any given radio’s user inter- 
face is virtually guaranteed to have poorly documented 
menus, submenus and button functions. 

Because the radios are customized for each customer, 
the manuals are often confusing and incomplete when 
used side-by-side with an end-user’s actual radio. For 
example, the Motorola XTS5000 handheld P25 radio’s 
manual [14] consists of nearly 150 pages that describe 
dozens of possible configurations and optional features, 
with incomplete instructions on how to activate features 
and interpret displayed information that typically advise 
the user to check with their local radio technician to find 
out how a given feature or switch works. (Other man- 
ufacturers’ radios have a similarly configurable design). 
That is, every customer must, in effect, produce a cus- 
tom user manual that describes how to properly use the 
security features as they happen to have been configured. 

In a typical configuration for the XTS5000, outbound 
encryption is controlled by a rotating switch located on 
the same stem as the channel selector knob. We found 
it to be easy to accidentally turn off encryption when 
switching channels. And other than a small symbol® 
etched on this switch, there is little positive indication of 
whether or not the radio is operating in encrypted mode. 
Figure 5 shows the radio user interface in clear mode; 
Figure 6 shows the same radio in encrypted mode. 

On the XTS portable radios, a flashing LED indicates 
the reception of encrypted traffic. However, the same 
LED serves multiple purposes. It glows steady to indi- 
cate transmit mode, ’slow” flashes to indicate received 
cleartext traffic, a busy channel, or low battery, and fast” 
flashes to indicate received encrypted traffic. We found 
it to be very difficult to distinguish reliably between re- 
ceived encrypted traffic and received unencrypted traffic. 
Also, the LED and the “secure” display icon are likely 
out of the operator’s field of view when an earphone or 
speaker/microphone is used or if the radio is held up to 
the user’s ear while listening (or mouth when talking). 

The Motorola P25 radios can be configured to give an 
audible warning of clear transmit or receive in the form 
of a “beep” tone sounded at the beginning of each outgo- 
ing or incoming transmission. But the same tone is used 
to indicate other radio events, including button presses, 
low battery, etc, and the tone is difficult to hear in noisy 


6On Motorola radios, this symbol is a circle with a line through it, 
unaccompanied by any explanatory label. This is the also the symbol 
used in many automobiles to indicate whether the air condition vents 
are open or closed. 


USENIX Association 





Figure 5: XTS5000 in “Clear” Mode 


environments. 

In summary, it appears to be quite easy to accidentally 
transmit in the clear, and correspondingly difficult to de- 
termine whether an incoming message was encrypted or 
with what key. 


3.7. Discussion 


The range of weaknesses in the P25 protocols and imple- 
mentations, taken individually, might represent only rel- 
atively small risks that can be effectively mitigated with 
careful radio configuration and user vigilance. But taken 
together, they interact in far more destructive ways. 

For example, if users are accustomed to occasionally 
having keys be out of sync and must frequently switch 
to clear mode, the risk that a user’s radio will mistak- 
enly remain in clear mode even when keys are available 
increases greatly. 

More seriously, these vulnerabilities provide a large 
menu of options that increase the leverage for targeted 
active attacks that become far harder to defend against. 

In the following sections, we describe practical at- 
tacks against P25 systems that exploit combinations of 
these protocol, implementation and usability weaknesses 
to extract sensitive information, deny service, or manip- 
ulate user behavior in encrypted P25 systems. We will 
also see that user and configuration errors that cause un- 
intended cleartext transmission are very common in prac- 
tice, even among highly sensitive users. 


4 Denial of Service 


Recall that P25 uses a narrowband modulation scheme 
designed to fit into channels compatible with the current 
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spectrum management practices for two-way land mo- 
bile radio. Unfortunately, although this was a basic de- 
sign constraint, it not only denies P25 systems the jam- 
ming resistance of modern digital spread spectrum sys- 
tems, it actually makes them more vulnerable to denial 
of service than the analog systems they replace. The P25 
protocols also permit potent new forms of deliberate in- 
terference, such as selective attacks that induce security 
downgrades, a threat that is exacerbated by usability de- 
ficiencies in current P25 radios. 


4.1 Jamming in Radio Systems 


Jamming attacks, in which a receiver is prevented from 
successfully interpreting a signal by noise injected onto 
the over the air channel, are a long-known and widely 
studied problem in wireless systems. 


In ordinary narrowband channelized analog FM sys- 
tems, jamming and defending against jamming is a mat- 
ter of straightforward analysis. The jammer succeeds 
when it overcomes the power level of the legitimate 
transmitter at the receiver. Otherwise the “capture ef- 
fect”, a phenomenon whereby the stronger of two sig- 
nals at or near the same frequency is the one demod- 
ulated by the receiver, permits the receiver to continue 
to understand the transmitted voice signal. An attacker 
may attempt to inject an intelligible signal or actual noise 
to prevent reception. In practice, an FM narrowband 
jammer will succeed reliably if it can deliver 3 to 6 dB 
more power to the receiver than the legitimate transmitter 
(to exceed the “capture ratio” of the system). Jamming 
in narrowband systems is thus for practical purposes a 
roughly equally balanced “arms race” between attacker 
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Figure 6: XTS5000 in “Encrypted” Mode 


and defender. Whoever has the most power wins.’ 


In digital wireless systems, the jamming arms race 
is more complex, depending on the selected modulation 
scheme and protocol. Whether the advantage falls to the 
jammer or to the defender depends on the particular mod- 
ulation scheme. 

Spread spectrum systems [5], and especially direct se- 
quence spread spectrum systems, can be made robust 
against jamming, either by the use of a secret spread- 
ing code or by more clever techniques described in [9, 1]. 
Without special information, a jamming transmitter must 
increase the noise floor not just on a single frequency 
channel, but rather across the entire band in use, at suffi- 
cient power to prevent reception. This requires far more 
power than the transmitter with which it seeks to inter- 
fere, and typically more aggregate power than an ordi- 
nary transmitter would be capable of. Modern spread 
spectrum systems such as those described in the refer- 
ences above can enjoy an average power advantage of 
30dB or more over a jammer. That is, in a spread spec- 
trum system operating over a sufficiently wide band, a 
jammer can be forced to deliver more than 30dB more 
aggregate power to the receiving station than the legiti- 
mate transmitter. 

By contrast, in a narrow-band digital modulation 
scheme such as P25’s current C4FM mode (or the lower- 
bandwidth Phase 2 successors proposed for P25), jam- 
ming requires only the transmission of a signal at a level 
near that of the legitimate transmitter. Competing sig- 
nals arriving at the receiver will prevent clean decoding 


7As a practical matter, the analog jamming arms race is actually 
tipped slightly in favor of the defender, since the attacker generally also 
has to worry about being discovered (and then eliminated) with radio 
direction finding and other countermeasures. More power makes the 
jammer more effective, but also easier to locate. 
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of a transmitted symbol, effectively randomizing or set- 
ting the received symbol. [2] That is, C4FM modulation 
suffers from approximately the same inherent degree of 
susceptibility to jamming as narrowband FM — a jammer 
must simply deliver slightly more power to the receiver 
than the legitimate transmitter. 

But, as we will see below, the situation is actually far 
more favorable to the jammer than analysis of its modu- 
lation scheme alone might suggest. In fact, the aggregate 
power level required to jam P25 traffic is actually much 
lower than that required to jam analog FM. This is be- 
cause an adversary can disrupt P25 traffic very efficiently 
by targeting only specific small portions of frames to jam 
and turning off its transmitter at other times. 


4.2 Reflexive Partial Frame Jamming 


We found that the P25 protocols are vulnerable to highly 
efficient jamming attacks that exploit not only the nar- 
rowband modulation scheme, but also the structure of the 
transmitted messages. 

Most P25 frames contain one or more small metadata 
subfields that are critical to the interpretation of the rest 
of the frame. For example, if the 4-bit Data Unit ID, 
present at the start of every frame, is not received cor- 
rectly, receivers cannot determine whether it is a header, 
voice, packet or other frame type. This is not the only 
critical subfield in a frame, but it is illustrative for our 
purposes. 

It is therefore unnecessary for an adversary to jam the 
entire transmitted data stream in order to prevent a re- 
ceiver from receiving it. It is sufficient for an attacker to 
prevent the reception merely of those portions of a frame 
that are needed for the receiver to make sense of the rest 
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of the frame. 


Unfortunately, the P25 frame encoding makes it par- 
ticularly easy and efficient for a jammer to attack these 
subfields in isolation. 


A P25 voice frame is 1728 bits in length. The entire 
NID subfield containing the NAC + DUID (and its error 
correction code) represents only 64 bits of these 1728 
bits. Jamming just the 64 bit NID subfield effectively 
denies the receiver the ability to interpret the other 1664 
bits of the frame, even if those bits are received unmo- 
lested . A jammer synchronized to attack just the NID 
subfield of voice transmission would need to operate at 
a duty cycle of only 3.7% during transmissions. Such a 
pulse lasts only about 1/100th of a second. 


To efficiently jam particular frame subfields, a jam- 
mer must synchronize its transmissions so that it begins 
transmitting at or just before the the first symbol of the 
targeted field is sent by the transmitter under attack, and 
end just after the last symbol of the field has been sent. At 
4800 symbols per second, each symbol lasts just longer 
than 0.2ms. This may seem at first to require an impos- 
sibly high degree of timing synchronization. But the P25 
framing scheme actually makes it quite straightforward 
for a jammer equipped with its own receiver to tightly 
synchronize to the target transmitter. Recall that each 
frame begins with an easily-recognized frame synchro- 
nization word, which the jammer can use to precisely 
trigger its interference so that it begins and ends at ex- 
actly the desired symbols. 

By careful synchronization, a jammer that attacks only 
the NID subfield of voice traffic can reduce its overall 
energy output so that it effectively has more than 14dB of 
average power advantage over the legitimate transmitter. 

It may be possible to improve the advantage to the 
jammer even more by careful analysis of the error correc- 
tion codes used in particular subfields in order to reduce 
the number of bits in the subfield that have to be jammed. 
(We assumed conservatively above that the attacker must 
jam every bit of the 64 bit NID field in order to prevent 
correct reconstruction of at least one bit of the NID pay- 
load, which clearly can be improved upon). This would 
permit even lower transmission times and average emit- 
ted power. It is not necessary to fully obliterate a critical 
protocol, merely to reliably (though not necessarily per- 
fectly) prevent its correct interpretation. 

Properly synchronized, a P25 jamming system can op- 
erate at a very low duty cycle that not only saves energy 
at the jammer and makes its equipment smaller and less 
expensive, but also makes the existence of the attack dif- 
ficult to diagnose and detect, and, if detected, require the 
use of specialized equipment to locate it. (Note that the 
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length of the jamming transmission is only about 10ms 
long, which is far shorter than the “oracle” transmissions 
discussed in Section 3.3.) Such a jamming system need 
only be relatively inexpensive, requires only a modest 
power supply, and is trivial to deploy in a portable config- 
uration that carries little risk to the attacker, as described 
below. 


We note that there is no analogous low-duty cycle jam- 
ming attack possible against the narrowband FM voice 
systems that P25 replaces. 


4.3 Selective Jamming Attacks 


An attacker need not attempt to jam every transmitted 
frame. The attacker can pick and choose which frames to 
attack in order to encourage the legitimate users to alter 
their behavior in particular ways. 


For example, it is straightforward to monitor for a non- 
zero MI field in a header frame (indicating an encrypted 
transmission) and to selectively jam portions of subse- 
quent frames, while leaving clear transmissions alone, in 
order to create the impression to the users of a radio net- 
work that, for unknown technical reasons, encryption has 
malfunctioned while clear transmission remains viable, 
thus inducing the users to downgrade to clear transmis- 
sions. If the users are already conditioned (through other 
weaknesses in P25) to unreliable cryptography, such an 
attack might be dismissed as routine. As we discuss in 
Section 5, it appears to be reasonable to expect that many 
such users are so conditioned. 


As another possibility, an attacker could choose to at- 
tack only uplink messages on the control channel of a 
trunked P25 system, thus effectively denying use of the 
entire trunked network at an extremely low cost to the 
attacker. 


In addition to the complexities of detecting and 
direction-finding an attack lasting mere hundredths or 
even thousandths of a second, adversaries can take steps 
to render their attacks less vulnerable to detection and 
more difficult for the operators of a radio network to 
prevent. For example, an attacker could choose to de- 
ploy multiple battery operated jamming devices in a 
metropolitan area, placing them in public locations to 
make tracing of the devices harder, or even surrepti- 
tiously attaching them to the vehicles of third parties 
such as taxis or delivery trucks to cause confusion, and 
to make the jammers harder to locate. Such devices may 
be made arbitrarily programmable, changing which of a 
group of devices is active at any one time or even taking 
commands over the air. 
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Figure 7: Girltech IMME, with modified firmware 


4.4 Experimental Results 


To confirm that low duty cycle subframe jamming is 
effective against standard P25 receiver implementations 
and to examine practical jammer architectures that might 
be employed by an adversary, we implemented a low- 
power subframe jammer for P25 traffic for testing in our 
laboratory environment. 

Recent work has shown that inexpensive software pro- 
grammable radios such as the Ettus USRP are capable of 
implementing the P25 protocols and acting as part of a 
P25 deployment [7]. Their versatility and the availabil- 
ity of open-source P25 software makes them attractive 
for reception, but round-trip delays between the receiver 
and transmitter make the platform less than ideal for sub- 
frame jamming. 

Instead, we implemented our proof-of-concept selec- 
tive jammer for P25 frames using the Texas Instru- 
ments CC1110 platform. The CC1110 chip combines 
a CC1101 radio with an 8051 microcontroller in a sin- 
gle system-on-chip package, allowing for faster reaction 
times than a USRP or other software radio could sup- 
port. When jamming reflexively, packets are passed to 
the 8051 one byte at a time, allowing a filter to selectively 
jam transmissions only if the received header matches an 
intended target. 

While any CC1110 board for the correct frequency 
range is sufficient, we used the GirlTech IMME, a com- 
mercial toy intended for pre-teen children to text mes- 
sage one another without cellular service. Presently 
priced at $30 USD, the package includes a handheld unit 
and a USB adapter, either of which may be used with our 
P25 client (for an aggregate price of $15 per jammer). 

In order to facilitate rapid development, our CC1110 
toolkit for P25 was divided into a Python-language client 
that communicates with native 8051 applications through 
an open-source debugger, the GoodFET. [8] Operations 
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Figure 8: Sub-Frame Reflex Jamming 


which do not require a fast reaction time are imple- 
mented only in Python, while timing-critical operations 
such as packet reception and sub-frame jamming are im- 
plemented as small fragments of C applications and are 
executed from RAM in the CC1110. Once a particular 
program has been verified to behave correctly, it can be 
rewritten as a stand-alone application to run from flash 
memory under battery power. 

As shown in Figure 8, our sub-frame jammer is trig- 
gered by the LDU Frame Sync bitstream. Upon receiving 
this sequence, the CC1101 switches from its Receive to 
Transmit states. Starting the transition before the last 8 
symbols of the 24-symbol Frame Sync are received al- 
lows the jammer-induced packet errors to begin from the 
very first byte of an LDU’s NID field. Holding the trans- 
mission for the entire duration of the NID subframe and 
then ending it immediately produces an overall duty cy- 
cle of 3.7% relative to the transmitter under attack. 

Our lab experiments were entirely successful. The 
GirlTech-based reflexive subframe jammer is able to re- 
liably prevent reception from a nearby Motorola P25 
transmitter as received by both a Motorola XTS2500 
transceiver and Icom PCR-2500, with the jammer and the 
transmitter under attack both operating at similar power 
levels and with similar distance from the receiver. A 
standard off-the-shelf external RF amplifier would be all 
that is necessary to extend this experimental apparatus to 
real-world, long-range use. While we did not perform 
high power or long-range jamming ourselves (and there 
are significant regulatory barriers to such experiments), 
we expect that an attacker would face few technical dif- 
ficulties scaling a jammer within the signal range of a 
typical metropolitan area. 


5 Encryption Failure in Fielded Systems 


Even if the P25 protocols and the design of P25 products 
might make them potentially vulnerable to user and con- 
figuration error, that does not automatically mean that 
fielded P25 systems are always insecure in practice. A 
natural question, then, is how successful the users of se- 
cure P25 radio systems are in preventing the unintended 
transmission of sensitive cleartext. 
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One way to answer this question might be be through 
a usability study, such as the one seminally performed by 
Whitten and Tygar with PGP [19], in which researchers 
train test subjects to configure and use a P25 system and 
then observe their behavior and performance in a con- 
trolled environment. While such studies can have value 
in evaluating, e.g., different user interface designs from 
among a set of candidates, they have inherent limita- 
tions. Aside from the cost of recruiting and observing 
suitable test subjects, it can be difficult to replicate “real 
world” conditions — especially the motivation of the users 
to maintain security while getting their work done — suf- 
ficiently well to ensure that the results are representative 
of the system’s true usability under field conditions [3]. 

Instead, we measured and analyzed the incidence of 
unintended cleartext leakage in real P25 systems car- 
rying a high volume of sensitive encrypted traffic with 
trained and motivated users: the secure tactical two-way 
radio systems used in federal criminal investigations. 


An Over-the-Air Analysis 


Although P25 is designed for general two-way radio use, 
the principal users of P25 in the US are law enforcement 
and public safety agencies. P25 has recently enjoyed par- 
ticularly widespread adoption by the federal government 
for the tactical radios used for surveillance and other con- 
fidential operations by Federal law enforcement agencies 
such as the DEA, FBI, the Secret Service, ICE, and so on. 

Most of the P25 tactical radio systems currently used 
by these agencies operate in one of two frequency bands 
in the VHF and UHF radio spectrum allocated exclu- 
sively for Federal use. There are approximately 2000 
two-way radio voice channels in the Federal spectrum 
allocation (comprising 11 MHz in the VHF band plus 
14 MHz in the UHF band, with channels spaced every 
12.5 KHz). Most of these channels are unused in any 
given geographic area. The individual channels used by 
each given agency are assigned on a region-by-region ba- 
sis, so a channel used by, say, the National Parks Ser- 
vice in one area might be used by the Bureau of Pris- 
ons in another area. Channels used for sensitive tactical 
law enforcement channels are mixed in among those of 
other Federal agencies and likewise vary on a regional 
basis. All Federal channel allocations are managed by 
the National Telecommunications and Information Ad- 
ministration and, unlike the state, local, and private fre- 
quency allocations managed by the Federal Communica- 
tions Commission, are not published.® 


8 Although the Federal agency frequency assignments are not offi- 
cially published by the government, some of the tactical frequencies 
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We built a P25 traffic interception system for the Fed- 
eral frequency bands, which we operated over a two year 
period in two US metropolitan areas. Our system con- 
sists of an array of Icom PCR-2500 software-controlled 
radio receivers [11], an inexpensive ($1000) wide-band 
receiver marketed to radio hobbyists and also popular in 
commercial monitoring applications. The PCR-2500 has 
several features that were important to us: relatively good 
performance in the federal VHF and UHF frequency 
bands, software programmability (via a USB interface), 
P25 capability via a daughterboard option, and the abil- 
ity to search a range of frequencies to identify those in 
active use. 

Our first task was to identify and catalog the particu- 
lar frequencies used for sensitive tactical operations in 
each of our two metropolitan areas. We programmed 
PCR-2500 receivers located at two locations in or near 
each city to identify frequencies with P25 signals being 
transmitted the federal frequency bands. We live mon- 
itored traffic on each identified frequency to determine 
whether it is used for law enforcement surveillance or 
other sensitive operations. After several months, we pos- 
itively identified 114 frequencies in one city and 109 in 
the other as being used for sensitive law enforcement op- 
erations. While some of the frequencies we found carried 
a great deal of traffic, many others were only used spo- 
radically. On every one of the sensitive frequencies we 
found, the traffic was predominantly encrypted, but still 
carried at least occasional cleartext. We could, of course, 
only monitor the transmissions that were sent in the clear 
(which extended the time required for our frequency cat- 
aloging process).? 

We then set up infrastructure to intercept every clear- 
text transmission that occurred on the sensitive frequen- 
cies we identified. We dedicated a number of individual 
PCR-2500 receivers to intercept traffic on a few particu- 
larly active frequencies, in order to ensure that we would 
capture virtually all of the cleartext that was transmitted 
on them. (The frequencies with dedicated receivers were 
the output channels of nearby repeater systems, which 
had the desirable effect of ensuring that any transmis- 


used by some agencies in some areas are relatively well known and can 
be found on the Internet. But most of the frequencies used for sensitive 
tactical communication are not published or widely known. 

°Tt is explicitly legal under 18 USC 2511 for any person in the US to 
intercept and monitor unencrypted law enforcement radio traffic, even 
sensitive communication that perhaps should be encrypted. However, 
in the interest of public safety, we decline to identify here the particular 
frequencies used by particular agencies. Also, to comply with our insti- 
tutional IRB requirements, we did not retain and will not disclose here 
any personally identifiable information we happened to monitor or de- 
rive, whether about surveillance targets or the government employees 
who were using the radios. 
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sions we did not record were not due to our receiver be- 
ing out of geographic range but rather due to the traf- 
fic being encrypted). For the remaining frequencies, We 
used two additional PCR-2500 receiver in different lo- 
cations around each city to continuously “scan” through 
the channels. and capture traffic detected during the scan 
(Icom supplies software that performs a similar func- 
tion, but it did not have sufficient capability to record 
the P25 metadata we were concerned with, so we had to 
write our own software for this purpose). We operated 
this arrangement, on an increasing number of discovered 
frequencies and with an increasing number of receivers, 
over a period of two years. 

We “live sampled” cleartext audio each day. We disre- 
garded “non-sensitive” traffic such as radio tests or other 
messages for which encryption would be unnecessary or 
inapprpriate (this represented only a small fraction of the 
traffic on the frequencies we were monitoring), leaving 
only “unintended” sensitive cleartext. We categorized 
each unintended cleartext message exchange according 
to the apparent error made or other reason it was sent in 
the clear. (We did not retain any identifying information 
about agents or targets). 

In every case, sensitive traffic we sampled was sent in 
the clear under one of three scenarios: 


e Individual Error: One or more users in the clear, 
but other users encrypted. In this scenario, all users 
clearly shared a common cryptographic key, since 
communication was able to occur unimpeded. But 
the users transmitting in the clear apparently ac- 
cidentally switched their radios to transmit in the 
clear mode. Because the offending users still re- 
ceived the other users’ encrypted traffic and because 
those users had no way to reliably tell that they were 
sometimes getting clear traffic, this situation typi- 
cally remained undetected. 


e Group Error: All users operated in the clear, but 
gave an indication that they believed they were op- 
erating in encrypted mode. In some cases, this in- 
volved one user explaining to another how to set 
the radio to encrypted mode, but actually described 
the procedure for setting it to clear mode. In other 
cases, the users would simply announce that they 
had just rekeyed their radios to operated in en- 
crypted mode (but were actually in the clear). 


e Keying Failure: One or more users did not have the 
correct key, is unable to receive encrypted transmis- 
sions, and asks (in the clear) that everyone switch to 
clear mode for the duration of an operation so that 
all group members are able to participate. 
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Across all agencies, the unintended cleartext we inter- 
cepted was roughly evenly split among the Individual 
Error, Group Error, and Keying Failure categories. In 
general, we found that even when users knew they were 
operating in the clear (because they expressly indicated 
that they were switching to clear mode due to keying fail- 
ure) and were engaged in sensitive operations, they made 
little effort to conceal the nature of their activity in their 
transmissions, and often appeared to “forget” that they 
were operating in the clear. 

Note that every system we monitored had P25 encryp- 
tion capability, and, indeed, most of the traffic sent was 
apparently successfully encrypted most of the time. Yet 
we still intercepted hundreds of hours of very sensitive 
traffic that was sent in the clear over the course of two 
years. While we will not identify here the agencies, lo- 
cations, or particular operations involved, we note that 
the traffic we monitored routinely disclosed some of the 
most sensitive law enforcement information that the gov- 
ernment holds, including: 


e Names and locations of criminal investigative tar- 
gets, including those involved in organized crime. 


e Names and other identifying features of confidential 
informants. 


e Descriptions and other characterizing features of 
undercover agents. 


e Locations and description of surveillance operatives 
and their vehicles. 


e Details about surveillance infrastructure being em- 
ployed against particular targets (hidden cameras, 
aircraft, etc.). 


e Information relayed by Title III wiretap plants. 


e Plans for forthcoming arrests, raids and other confi- 
dential operations. 


During March, April and May 2011, we intercepted 
a mean of 23 minutes of unintended sensitive cleartext 
per day per city across all monitored frequencies. Note 
that the variance was high; on some days, particularly 
weekends and holidays, we would capture less than one 
minute, while on others, we captured several hours. We 
monitored sensitive transmissions about operations by 
agents in every Federal law enforcement agency in the 
Department of Justice and the Department of Homeland 
Security. Most traffic was apparently related to crimi- 
nal law enforcement, but some of the traffic was clearly 
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related to other sensitive operations, including counter- 
terrorism investigations and executive protection of high 
ranking officials.!° 


6 End-User Stopgap Mitigations 


Many of the security problems in P25 arise from basic 
protocol design and architectural decisions that cannot 
be altered without a substantial, top-to-bottom redesign 
of the protocols and of the assumptions under which it 
operates. Given the critical and highly sensitive nature 
of much of the P25 user base, we strongly urge that a 
high priority be placed on such a redesign. However, 
until that occurs, there is little that the P25 user can do to 
defend against, e.g., the denial of service weaknesses we 
identified. 

Other vulnerabilities arise from implementation errors 
or poor choices made by individual vendors (such as the 
transmission of unit IDs in the clear). These can be fixed 
without a redesign, but again, P25 users can do little to 
defend themselves here except to wait for the vendors to 
address these errors and deficiencies. 

However, we note that there may be two areas in which 
P25 users and system administrators can immediately 
reduce the incidence of unintended sensitive cleartext 
transmission: improving the configurable of radio user 
interfaces and re-thinking their rekeying policies. 

At least half of the unintended cleartext we captured 
was attributable to some form of “user error’. However, 
it would be a mistake to simply dismiss this as careless- 
ness or to focus entirely on user awareness and training. 
In fact, these “user” errors are effectively invited by the 
radio user interfaces, and it is these interfaces to which 
we should assign the blame. But, fortunately, many cur- 
rent P25 radios can be “customer configured” by the end- 
user’s system manager to make the security state clearer 
to the user. 

In particular, we suggest that the radios be configured 
without the use of the “secure” switch. Instead, encryp- 
tion should be configured (“strapped”) to be always on 
(or always off) for each channel. Displayed channel 
names should be chosen to reflect whether encryption is 
stropped on or off, e.g., channel ’TAC1” might be re- 
named instead to “TAC1 Secure” or “TAC1 Clear’. (If 
both secure and clear capability are required on the same 
frequency, the channel assignment can be duplicated). 


!0We are currently working with the agencies we monitored to help 
them improve their radio security practices. However, because many of 
the weaknesses that lead to cleartext leakage result from basic proper- 
ties of the protocols and their implementations, incidents of unintended 
cleartext are likely to continue to occur from time to time even with in- 
creased user vigilance. 
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The second major cause of unintended cleartext that 
we captured arose from users who did not have current 
keys, often due to key expiration and the failure of the 
OTAR protocol. Some systems rekey weekly or monthly, 
and we found that users are inevitably left without cur- 
rent key material as a result. 

We suggest that systems be configured to greatly min- 
imize the required frequency of rekeying and to main- 
tain keys for much longer than they are under current 
practice. Instead of monthly rekeys, systems should de- 
ploy long-lived, non-volatile keys that are changed only 
at very long intervals or if an actual compromise (such as 
a lost radio) is discovered. This will greatly improve the 
likelihood that users who wish to communicate securely 
will share common key material when they need it. 


7 Conclusions 


APCO P25 is a widely deployed protocol aimed at crit- 
ical public safety, law enforcement, and national secu- 
rity applications. The user base for secure P25 is rapidly 
growing in the United States and other countries, espe- 
cially among federal law enforcement and intelligence 
agencies that conduct surveillance and other covert ac- 
tivities against sophisticated adversaries. 

As a wireless system, P25 is inherently vulnerable to 
passive traffic interception and active attack, and so it 
must rely entirely on cryptographic techniques for its op- 
tional security features. And yet we found the protocols 
and its implementations suffer from serious weaknesses 
that leak sensitive data, invite inadvertent clear transmis- 
sion in “secure” mode, and permit active and passive 
tracking and traffic analysis. The protocol is difficult to 
use properly even when not under attack, as evidenced 
by our interception of large volumes of sensitive cleart- 
ext sent by mistake. 

The protocol is particularly vulnerable to denial of ser- 
vice. Perhaps uniquely among modern digital voice ra- 
dio systems, P25 can be effectively jammed with only 
a fraction of the aggregate signal power used by the le- 
gitimate user, by attackers with low cost equipment and 
without access to secrets such as keys or user-specific 
codes. Jamming attacks can also be used to aid in the 
exploitation of other weaknesses, such as selectively dis- 
abling security features to force users into the clear. 

It is reasonable to wonder why this protocol, which 
was developed over many years and is used for sensi- 
tive and critical applications, is so difficult to use and 
so vulnerable to attack. We might compare P25 with 
other voice encryption protocols and systems, such as the 
US Government’s STU-III and STE [18] encrypting tele- 
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phone system used for classified traffic, that perform an 
ostensibly similar function and yet do not appear to suf- 
fer from such a large number of exploitable deficiencies. 
However, we note that P25 is based on a very different 
model from that of most cryptographic communication 
protocols. In the vast majority of cryptographic proto- 
cols, both sender and receiver are active participants in 
the protocol, and perform a negotiation or handshake be- 
fore communication proceeds. In such protocols, both 
parties typically have the opportunity to discover and re- 
cover from errors, or abort the transaction, before any 
data is transmitted. P25, however, while used in “two- 
way” radio systems, is essentially a unilateral broadcast 
system. All cryptographic decisions are made entirely 
by the sender, with the receiver only a passive recipi- 
ent of whatever the sender has transmitted. Protocols for 
such broadcast-based encryption have not been as widely 
formally studied as other forms of secure communica- 
tion (with the possible exception of encryption in direct- 
broadcast television systems), and may represent a rich 
and difficult class of problem worthy of more attention 
by our community. We explore this in more detail in ref- 
erence [4]. 
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Abstract usage of resources, these centralized storage services 


During the past few years, a vast number of online file 
storage services have been introduced. While several of 
these services provide basic functionality such as upload- 
ing and retrieving files by a specific user, more advanced 
services offer features such as shared folders, real-time 
collaboration, minimization of data transfers or unlim- 
ited storage space. Within this paper we give an overview 
of existing file storage services and examine Dropbox, 
an advanced file storage solution, in depth. We analyze 
the Dropbox client software as well as its transmission 
protocol, show weaknesses and outline possible attack 
vectors against users. Based on our results we show that 
Dropbox is used to store copyright-protected files from 
a popular filesharing network. Furthermore Dropbox can 
be exploited to hide files in the cloud with unlimited stor- 
age capacity. We define this as online slack space. We 
conclude by discussing security improvements for mod- 
ern online storage services in general, and Dropbox in 
particular. To prevent our attacks cloud storage opera- 
tors should employ data possession proofs on clients, a 
technique which has been recently discussed only in the 
context of assessing trust in cloud storage operators. 


1 Introduction 


Hosting files on the Internet to make them retrievable 
from all over the world was one of the goals when the 
Internet was designed. Many new services have been 
introduced in recent years to host various type of files 
on centralized servers or distributed on client machines. 
Most of today’s online storage services follow a very 
simple design and offer very basic features to their users. 
From the technical point of view, most of these services 
are based on existing protocols such as the well known 
FTP [28], proprietary protocols or WebDAV [22], an ex- 
tension to the HTTP protocol. 

With the advent of cloud computing and the shared 
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have gained momentum in their usage, and the number 
of users has increased heavily. In the special case of on- 
line cloud storage the shared resource can be disc space 
on the provider’s side, as well as network bandwidth 
on both the client’s and the provider’s side. An online 
storage operator can safely assume that, besides private 
files as well as encrypted files that are specific and 
different for every user, a lot of files such as setup files 
or common media data are stored and used by more than 
one user. The operator can thus avoid storing multiple 
physical copies of the same file (apart from redundancy 
and backups, of course). To the best of our knowledge, 
Dropbox is the biggest online storage service so far 
that implements such methods for avoiding unnecessary 
traffic and storage, with millions of users and billions 
of files [24]. From a security perspective, however, the 
shared usage of the user’s data raises new challenges. 
The clear separation of user data cannot be maintained 
to the same extent as with classic file hosting, and 
other methods have to be implemented to ensure that 
within the pool of shared data only authorized access 
is possible. We consider this to be the most important 
challenge for efficient and secure “cloud-based” storage 
services. However, not much work has been previously 
done in this area to prevent unauthorized data access or 
information leakage. 


We focus our work on Dropbox because it is the 
biggest cloud storage provider that implements shared 
file storage on a large scale. New services will offer sim- 
ilar features with cost and time savings on both the client 
and the operators side, which means that our findings are 
of importance for all upcoming cloud storage services as 
well. Our proposed measurements to prevent unautho- 
rized data access and information leakage, exemplarily 
demonstrated with Dropbox, are not specific to Dropbox 
and should be used for other online storage services as 
well. We believe that the number of cloud-based storage 
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operators will increase heavily in the near future. 
Our contribution in this paper is to: 


e Document the functionality of an advanced cloud 
storage service with server-side data deduplication 
such as Dropbox. 


e Show under what circumstances unauthorized ac- 
cess to files stored within Dropbox is possible. 


e Assess if Dropbox is used to store copyright- 
protected material. 


e Define online slack space and the unique problems 
it creates for the process of a forensic examination. 


e Explain countermeasures, both on the client and the 
server side, to mitigate the resulting risks from our 
attacks for user data. 


The remainder of this paper is organized as follows. 
Related work and the technical details of Dropbox are 
presented in Section 2. In Section 3 we introduce an at- 
tack on files stored at Dropbox, leading to information 
leakage and unauthorized file access. Section 4 discusses 
how Dropbox can be exploited by an adversary in var- 
ious other ways while Section 5 evaluates the feasibil- 
ity of these attacks. We conclude by proposing various 
techniques to reduce the attack surface for online storage 
providers in Section 6. 


2 Background 


This section describes the technical details and imple- 
mented security controls of Dropbox, a popular cloud 
storage service. Most of the functionality is attributed 
to the new cloud-paradigm, and not specific to Dropbox. 
In this paper we use the notion of cloud computing as de- 
fined in [9], meaning applications that are accessed over 
the Internet with the hardware running in a data center 
not necessarily under the control of the user: 


“Cloud Computing refers to both the applica- 
tions delivered as services over the Internet and 
the hardware and systems software in the data 
centers that provide those services.” ... “The 
datacenter hardware and software is what we 
will call a Cloud.” 


In the following we describe Dropbox and related litera- 
ture on cloud storage. 


2.1 Dropbox 


Since its initial release in September 2008 Dropbox 
has become one of the most popular cloud storage 
provider on the Internet. It has 10 million users and 
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stores more then 100 billion files as of May 2011 [2] 
and saves | million files every 5 minutes [3]. Dropbox 
is mainly an online storage service that can be used 
to create online backups of files, and one has access 
to files from any computer or similar device that is 
connected to the Internet. A desktop client software 
available for different operating systems keeps all the 
data in a specified directory in sync with the servers, and 
synchronizes changes automatically among different 
client computers by the same user. Subfolders can be 
shared with other Dropbox users, and changes in shared 
folders are synced and pushed to every Dropbox account 
that has been given access to that shared folder. Large 
parts of the Dropbox client are written in Python. 


Internally, Dropbox does not use the concept of files, 
but every file is split up into chunks of up to 4 megabytes 
in size. When a user adds a file to his local Dropbox 
folder, the Dropbox client application calculates the hash 
values of all the chunks of the file using the SHA-256 
algorithm [19]. The hash values are then sent to the 
server and compared to the hashes already stored on 
the Dropbox servers. If a file does not exist in their 
database, the client is requested to upload the chunks. 
Otherwise the corresponding chunk is not sent to the 
server because a copy is already stored. The existing file 
on the server is instead linked to the Dropbox account. 
This approach allows Dropbox to save traffic and storage 
costs, and users benefit from a faster syncing process 
if files are already stored on the Dropbox servers. The 
software uses numerous techniques to further enhance 
efficiency e.g., delta encoding, to only transfer those 
parts of the files that have been modified since the 
last synchronization with the server. If by any chance 
two distinct files should have the same hash value, the 
user would be able to access other users content since 
the file stored on the servers is simply linked to the 
users Dropbox account. However, the probability of a 
coincidental collision in SHA-256 is negligibly small. 


The connections between the clients and the Drop- 
box servers are secured with SSL. Uploaded data is 
encrypted with AES-256 and stored on Amazons S3 
storage service that is part of the Amazon Web Services 
(AWS) [1]. The AES key is user independent and only 
secures the data during storage at Amazon S3, while 
transfer security relies on SSL. Our research on the 
transmission protocol showed that data is directly sent 
to Amazon EC2 servers. Therefore, encryption has to 
be done by EC2 services. We do not know where the 
keys are stored and if different keys are used for each 
file chunk. However, the fact that encryption and storage 
is done at the same place seems questionable to us, as 
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Amazon is most likely able to access decryption keys !. 


After uploading the chunks that were not yet in the 
Dropbox storage system, Dropbox calculates the hash 
values on their servers to validate the correct transmis- 
sion of the file, and compares the values with the hash 
values sent by the client. If the hash values do not match, 
the upload process of the corresponding chunk is re- 
peated. The drawback of this approach is that the server 
can only calculate the hash values of actually uploaded 
chunks; it is not able to validate the hash values of files 
that were already on Dropbox and that were provided by 
the client. Instead, it trusts the client software and links 
the chunk on the server to the Dropbox account. There- 
fore, spoofing the hash value of a chunk added to the 
local Dropbox folder allows a malicious user to access 
files of other Dropbox users, given that the SHA-256 
hash values of the file’s chunks are known to the attacker. 


Due to the recent buzz in cloud computing many com- 
panies compete in the area of cloud storage. Major op- 
erating system companies have introduced their services 
with integration into their system, while small startups 
can compete by offering cross-OS functionality or more 
advanced security features. Table | compares a selec- 
tion of popular file storage providers without any claim 
for completeness. Note that “encrypted storage” means 
that the file is encrypted locally before it is sent to the 
cloud storage provider and shared storage means that it 
is possible to share files and folders between users. 


2.2 Related Work 


Related work on secure cloud storage focuses mainly 
on determining if the cloud storage operator is still in 
possession of the client’s file, and if it has been modified. 
An interesting survey on the security issues of cloud 
computing in general can be found in [30]. A summary 
of attacks and new security problems that arise with the 
usage of cloud computing has been discussed in [17]. 
In a paper by Shacham et al. [11] it was demonstrated 
that it is rather easy to map the internal infrastructure of 
a cloud storage operator. Furthermore they introduced 
co-location attacks where they have been able to place 
a virtual machine under their control on the same 
hardware as a target system, resulting in information 
leakage and possible side-channel attacks on a virtual 
machine. 


‘Independently found and confirmed by Christopher Soghoian [5] 
and Ben Adida [4] 
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Early publications on file retrievability [25, 14] check 
if a file can be retrieved from an untrusted third party 
without retransmitting the whole file. Various papers 
propose more advanced protocols [11, 12, 20] to ensure 
that an untrusted server has the original file without 
retrieving the entire file, while maintaining an overall 
overhead of O(1). Extensions have been published 
that allow checking of dynamic data, for example 
Wang et al. [32] use a Merkle hash tree which allows 
a third party auditor to audit for malicious providers 
while allowing public verifiability as well as dynamic 
data operations. The use of algebraic signatures was 
proposed in [29], while a similar approach based on ho- 
momorphic tokens has been proposed in [31]. Another 
cryptographic tree structure is named “Cryptree” [23] 
and is part of the Wuala online storage system. It 
allows strong authentication by using encryption and 
can be used for P2P networks as well as untrusted 
cloud storage. The HAIL system proposed in [13] 
can be seen as an implementation of a service-oriented 
version of RAID across multiple cloud storage operators. 


Harnik et al. describe similar attacks in a recent pa- 
per [24] on cloud storage services which use server-side 
data deduplication. They recommend using encryption 
to stop server-side data deduplication, and propose a ran- 
domized threshold in environments where encryption is 
undesirable. However, they do not employ client-side 
data possession proofs to prevent hash manipulation at- 
tacks, and have no practical evaluation for their attacks. 


3 Unauthorized File Access 


In this section we introduce three different attacks on 
Dropbox that enable access to arbitrary files given 
that the hash values of the file, respectively the file 
chunks, are known. If an arbitrary cloud storage service 
relies on the client for hash calculation in server-side 
data deduplication implementations, these attacks are 
applicable as well. 


3.1 Hash Value Manipulation Attack 


For the calculation of SHA-256 hash values, Drop- 
box does not use the hashlib library which is part 
of Python. Instead it delegates the calculation to 
OpenSSL [18] by including a wrapper library called 
NCrypto [6]. The Dropbox clients for Linux and Mac 
OS X dynamically link to libraries such as NCrypto 
and do not verify their integrity before using them. We 
modified the publicly available source code of NCrypto 
so that it replaces the hash value that was calculated by 
OpenSSL with our own value (see Figure 1), built it 
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Name Protocol Encrypted transmission Encrypted storage Shared storage 
Dropbox proprietary yes no yes 
Box.net proprietary yes yes (enterprise only) yes 
Wuala Cryptree yes yes yes 
TeamDrive many yes yes yes 
SpiderOak proprietary yes yes yes 
Windows Live Skydrive | WebDAV yes no yes 
Apple iDisk WebDAV no no no 
Ubuntu One ulstorage yes no yes 





Table 1: Online Storage Providers 


and replaced the library that was shipped with Dropbox. 
The Dropbox client does not detect this modification 
and transmits for any new file in the local Dropbox the 
modified hash value to the server. If the transmitted 
hash value does not exist in the server’s database, the 
server requests the file from the client and tries to verify 
the hash value after the transmission. Because of our 
manipulation on the client side, the hash values will 
not match and the server would detect that. The server 
would then re-request the file to overcome an apparent 
transmission error. 









Dropbox-Client 
(Python) 


replacing 
hash value 


SHA-256 





OpenSSL 
(hash value calculation) 


Figure 1: Hash Value Manipulation Attack 


However, if the hash value is already in the server’s 
databases the server trusts the hash value calculation of 
the client and does not request the file from the client. 
Instead it links the corresponding file/chunk to the 
Dropbox account. Due to the manipulation of the hash 
value we thus got unauthorized access to arbitrary files. 


This attack is completely undetectable to the user. If 
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the attacker already knows the hash values, he can down- 
load files directly from the Dropbox server and no inter- 
action with the client is needed which could be logged or 
detected on the client side. The victim is unable to notice 
this in any way, as no access to his computer is required. 
Even for the Dropbox servers this unauthorized access to 
arbitrary files is not detectable because they believe the 
attacker already owns the files, and simply added them 
to their local Dropbox folder. 


3.2 Stolen Host ID Attack 


During setup of the Dropbox client application on a 
computer or smartphone, a unique host ID is created 
which links that specific device to the owner’s Dropbox 
account. The client software does not store username 
and password. Instead, the host ID is used for client 
and user authentication. It is a random looking 128-bit 
key that is calculated by the Dropbox server from 
several seeding values provided by the client (e.g. 
username, exact date and time). The algorithm is not 
publicly known. This linking requires the user’s account 
credentials. When the client on that host is success- 
fully linked, no further authentication is required for 
that host as long as the Dropbox software is not removed. 


If the host ID is stolen by an attacker, extracted by 
malware or by social engineering, all the files on that 
users accounts can be downloaded by the attacker. He 
simply replaces his own host ID with the stolen one, re- 
syncs Dropbox and consequently downloads every file. 


3.3. Direct Download Attack 


Dropbox’s transmission protocol between the client 
software and the server is built on HTTPS. The client 
software can request file chunks from _https://dl- 
clientXX.dropbox.com/retrieve (where XX is replaced 
by consecutive numbers) by submitting the SHA-256 
hash value of the file chunk and a valid host ID as 
HTTPS POST data. Surprisingly, the host ID doesn’t 
even need to be linked to a Dropbox account that owns 
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the corresponding file. Any valid host ID can be used 
to request a file chunk as long as the hash value of the 
chunk is known and the file is stored at Dropbox. As 
we will see later, Dropbox hardly deletes any data. It 
is even possible to just create an HTTPS request with 
any valid host ID, and the hash value of the chunk to 
be downloaded. This approach could be easily detected 
by Dropbox because a host ID that was not used to 
upload a chunk or is known to be in possession of the 
chunk would try to download it. By contrast the hash 
manipulation attack described above is undetectable for 
the Dropbox server, and (minor) changes to the core 
communication protocol would be needed to detect it. 


3.4 Attack Detection 


To sum up, when an attacker is able to get access to the 
content of the client database, he is able to download all 
the files of the corresponding Dropbox account directly 
from the Dropbox servers. No further access to the vic- 
tim’s system is needed, and in the simplest case only the 
host ID needs to be sent to the attacker. An alternative 
approach for the attacker is to access only specific files, 
by obtaining only the hash values of the file. The owner 
of the files is unable to detect that the attacker accessed 
the files, for all three attacks. From the cloud storage ser- 
vice operators point of view, the stolen host-ID attack as 
well as the direct download attack are detectable to some 
extent. We discuss some countermeasures in section 6. 
However, by using the hash manipulation attack the at- 
tacker can avoid detection completely, as this form of 
unauthorized access looks like the attacker already owns 
the file to Dropbox. Table 2 gives an overview of all of 
the different attacks that can lead to unauthorized file ac- 
cess and information leakage *. 


4 Attack Vectors and Online Slack Space 


This section discusses known attack techniques to exploit 
cloud storage and Dropbox on a large scale. It outlines 
already known attack vectors, and how they could be 
used with the help of Dropbox, or any other cloud stor- 
age service with weak security. Most of them can have 
a severe impact and should be considered in the threat 
model of such services. 


2We communicated with Dropbox and reported our findings prior 
to publishing this paper. They implemented a temporary fix to prevent 
these types of attacks and will include a permanent solution in future 
versions. 
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4.1 Hidden Channel, Data Leakage 


The attacks discussed above can be used in numerous 
ways to attack clients, for example by using Dropbox 
as a drop zone for important and possibly sensitive data. 
If the victim is using Dropbox (or any other cloud stor- 
age services which is vulnerable to our discovered at- 
tack) these services might be used to exfiltrate data a lot 
stealthier and faster with a covert channel than using reg- 
ular covert channels [16]. The amount of data that needs 
to be sent over the covert channel would be reduced to a 
single host ID or the hash values of specific files instead 
of the full file. Furthermore the attacker could copy im- 
portant files to the Dropbox folder, wait until they are 
stored on the cloud service and delete them again. After- 
wards he transmits the hash values to the attacker and the 
attacker then downloads these files directly from Drop- 
box. This attack requires that the attacker is able to exe- 
cute code and has access to the victim’s file system e.g. 
by using malware. One might argue that these are tough 
preconditions for this scenario to work. However, as in 
example, in the case of corporate firewalls this kind of 
data leakage is much harder to detect as all traffic with 
Dropbox is encrypted with SSL and the transfers would 
blend in perfectly with regular Dropbox activity, since 
Dropbox itself is used for transmitting the data. Cur- 
rently the client has no control measures to decide upon 
which data might get stored in the Dropbox folder. The 
scheme for leaking information and transmitting data to 
an attacker is depicted in Figure 2. 
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Victim using Dropbox 


Figure 2: Covert Channel with Dropbox 


4.2 Online Slack Space 


Uploading a file works very similarly to downloading 
with HTTPS (as described above, see section 3.3). The 
client software uploads a chunk to Dropbox by calling 
https://dl-clientXX.dropbox.com/store with the hash 
value and the host ID as HTTPS POST data along with 
the actual data. After the upload is finished, the client 
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Method Detectability Consequences 
Hash Value Manipulation Attack | Undetectable | Unauthorized file access 
Direct Download Attack Dropbox only Unauthorized file access 
Stolen Host ID Attack Dropbox only Get all user files 





Table 2: Variants of the Attack 


software links the uploaded files to the host ID with 
another HTTPS request. The updated or newly added 
files are now pushed to all computers of the user, and to 
all other user accounts if the folder is a shared folder. 


A modified client software can upload files without 
limitation, if the linking step is omitted. Dropbox can 
thus be used to store data without decreasing the avail- 
able amount of data. We define this as online slack space 
as it is similar to regular slack space [21] from the per- 
spective of a forensic examiner where information is hid- 
den in the last block of files on the filesystem that are not 
using the entire block. Instead of hiding information in 
the last block of a file, data is hidden in Dropbox chunks 
that are not linked to the attackers account. If used in 
combination with a live CD operating system, no traces 
are left on the computer that could be used in the foren- 
sic process to infer the existence of that data once the 
computer is powered down. We believe that there is no 
limitation on how much information could be hidden, as 
the exploited mechanisms are the same as those which 
are used by the Dropbox application. 


4.3 Attack Vector 


If the host ID is known to an attacker, he can upload 
and link arbitrary files to the victim’s Dropbox account. 
Instead of linking the file to his account with the second 
HTTPS request, he can use an arbitrary host ID with 
which to link the file. In combination with an exploit 
of the operating system file preview functions, e.g. on 
one of the recent vulnerabilities in Windows °, Linux ¢, 
or MacOS °, this becomes a powerful exploitation 
technique. An attacker could use any 0-day weakness 
in the file preview of supported operating systems to 
execute code on the victim’s computer, by pushing a 
manipulated file into his Dropbox folder and waiting for 
the user to open that directory. Social engineering could 
additionally be used to trick the victim into executing a 
file with a promising filename. 


To get access to the host ID in the first place is tricky, 
and in any case access to the filesystem is needed in 
the first place. This however does not reduce the conse- 


3Windows Explorer: CVE-2010-2568 or CVE-2010-3970 


4Evince in Nautilus: CVE-2010-2640 
5 Finder: CVE-2006-2277 
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quences, as it is possible to store files remotely in other 
peoples Dropbox. A large scale infection using Drop- 
box is however very unlikely, and if an attacker is able to 
retrieve the host ID he already owns the system. 


5 Evaluation 


This section studies some of the attacks introduced. We 
evaluate whether Dropbox is used to store popular files 
from the filesharing network thepiratebay.org ° as well as 
how long data is stored in the previously defined online 
slack space. 


5.1 Stored files on Dropbox 


With the hash manipulation attack and the direct down- 
load attack described above it becomes possible to test 
if a given file is already stored on Dropbox. We used 
that to evaluate if Dropbox is used for storing filesharing 
files, as filesharing protocols like BitTorrent rely heavily 
on hashing for file identification. We downloaded the top 
100 torrents from thepiratebay.org [7] as of the middle of 
September 2010. Unfortunately, BitTorrent uses SHA-1 
hashes to identify files and their chunks, so the informa- 
tion in the .torrent file itself is not sufficient and we had 
to download parts of the content. As most of the files 
on BitTorrent are protected by copyright, we decided to 
download every file from the .torrent that lacks copyright 
protection to protect us from legal complaints, but are 
still sufficient to prove that Dropbox is used to store these 
kind of files. To further proctect us against complaints 
based on our IP address, our BitTorrent client was modi- 
fied to prevent upload of any data, as described similarly 
in [27]. We downloaded only the first 4 megabytes of any 
file that exceeds this size, as the first chunk is already suf- 
ficient to tell if a given file is stored on Dropbox or not 
using the hash manipulation attack. 

We observed the following different types of files that 
were identified by the .torrent files: 


e Copyright protected content such as movies, songs 
or episodes of popular series. 


e “Identifying files” that are specific to the copyright 
protected material, such as sample files, screen cap- 
tures or checksum files, but without copyright. 


Online at http: //thepiratebay.org 
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e Static files that are part of many torrents, such as 
release group information files or links to websites. 


Those “identifying files” we observed had the follow- 
ing extensions and information: 


e .nfo: Contains information from the release group 
that created the .torrent e.g., list of files, installation 
instructions or detailed information and ratings for 
movies. 


.srt: Contains subtitles for video files. 


e .sfv: Contains CRC32 checksums for every file 
within the .torrent. 


jpg: Contains screenshots of movies or album cov- 
ers. 


.torrent: The torrent itself contains the hash values 
of all the files, chunks as well as necessary tracker 
information for the clients. 


In total from those top 100 torrent archives, 98 con- 
tained identifying files. We removed the two .torrents 
from our test set that did not contain such identifying 
files. 24 hours later we downloaded the newest entries 
from the top 100 list, to check how long it takes from the 
publication of a torrent until it is stored on Dropbox. 9 
new torrents, mostly series, were added to the test set. In 
Table 3 we show in which categories they where catego- 
rized by thepiratebay.org. 


Category | Quantity 
Application 3 
Game 5 
Movie 64 
Music 6 
Series 29 
Sum 107 





Table 3: Distribution of tested .torrents 


When we downloaded the “identifying files” from 
these 107 .torrent, they had in total approximately 460k 
seeders and 360k leechers connected (not necessarily 
disjoint), with the total number of complete downloads 
possibly much higher. For every .torrent file and every 
identifying file from the .torrent’s content we generated 
the sha256 hash value and checked if the files were stored 
on Dropbox, in total 368 hashes. If the file was bigger 
then 4 megabytes, we only generated the hash of the first 
chunk. Our script did not use the completely stealthy ap- 
proach described above, but the less stealthy approach 
by creating an HTTPS request with a valid host ID as the 
overall stealthiness was in our case not an issue. 
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From those 368 hashes, 356 files were retrievable, 
only 12 hashes were unknown to Dropbox and the cor- 
responding files were not stored on Dropbox. Those 12 
files were linked to 8 .torrent files. The details: 


e In one case the identifying file of the .torrent was 
not on Dropbox, but the .torrent file was. 


e In three cases the .torrent file was not on Dropbox, 
but the identifying files were. 


e In four cases the .nfo file was not on Dropbox, but 
other iln fact, it might be the case that only one per- 
son uses Dropbox to store these files. dentifying 
files from the same .torrent were. 


This means that for every .torrent either the .torrent 
file, the content or both are easily retrievable from Drop- 
box once the hashes are known. Table 4 shows the num- 
bers in details, where hit rate describes how many of 
them were retrievable from Dropbox. 

















File Quantity Hitrate Hitrate rel. 
.torrent: 107 106 99% 

.nfo: 53 49 92% 
others: 208 201 97% 
In total: 368 356 97% 





Table 4: Hit rate for filesharing 


Furthermore we analyzed the age of the .torrents to 
see how quick Dropbox users are to download the .tor- 
rents and the corresponding content, and to upload ev- 
erything to Dropbox. Most of the .torrent files were rela- 
tively young, as approximately 20 % of the top 100 .tor- 
rent files were less than 24 hours on piratebay before we 
were able to retrieve them from Dropbox. Figure 3 shows 
the distribution of age from all the .torrents: 


5.2 Online Slack Space Evaluation 


To assess if Dropbox could be used to hide files by 
uploading without linking them to any user account, we 
generated a set of 30 files with random data and uploaded 
them with the HTTPS request method. Furthermore we 
uploaded 55 files with a regular Dropbox account and 
deleted them right afterwards, to assess if Dropbox ever 
deletes old user data. We furthermore evaluated if there 
is some kind of garbage collection that removes files 
after a given threshold of time since the upload. The 
files were then downloaded every 24 hours and checked 
for consistency by calculating multiple hash functions 
and comparing the hashvalues. By using multiple files 
with various sizes and random content we minimized the 
likelihood of an unintended hash collision and avoided 
testing for a file that is stored by another user and thus 
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Figure 3: Age of .torrents 


always retrievable. Table 5 summarizes the setup. 

















Method of upload #  Testduration  Hitrate 
Regular folder 25 6 months 100% 
Shared folder 30 6 months 100% 

HTTPS request 30 >3 months 50% 
In total: 85 — 100% 





Table 5: Online slack experiments 


Long term undelete: With the free account users 
can undo file modifications or undelete files through 
the webinterface from the last 30 days. With a so 
called “Pro” account (where the users pay for additional 
storage space and other features) undelete is available 
for all files and all times. We uploaded 55 files in total 
on October 7th 2010, 30 files in a shared folder with 
another Dropbox account and 25 files in an unshared 
folder. Until Dropbox fixed the HTTPS download attack 
at the end of April 2011, 100% have been constantly 
available. More then 6 months after uploading, all files 
were still retrievable, without exception. 


Online slack: We uploaded 30 files of various sizes 
without linking them to any account with the HTTPS 
method at the beginning of January 2011. More then 4 
weeks later, all files were still retrievable. When Drop- 
box fixed the HTTPS download attack in late April 2011, 
50% of the files were still available. See Figure 4 for de- 
tails. 


5.3 Discussion 


It surprised us that from every .torrent file, either the 
torrent, the content or both could be retrieved from 
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Figure 4: Online slack without linking over time 


Dropbox, especially considering that some of the 
.torrent files were only a few hours created before we 
retrieved them. 97% means that Dropbox is heavily 
used for storing files from filesharing networks. It is 
also interesting to note that some of the .torrent files 
contained more content regarding storage space than 
the free Dropbox account currently offers (2 gigabytes 
at the time of writing). 11 out of the set of tested 107 
.torrents contained more then 2 gigabytes as they were 
DVD images, the biggest with 7.2 gigabytes in total size. 
This means that whoever stored those files on Dropbox 
has either a Dropbox Pro account (for which he or she 
pays a monthly fee), or that he invited a lot of friends to 
get additional storage space from the Dropbox referral 
program. 


However, we could only infer the existence of these 
files. With the approach we used it is not possible to 
quantify to what extent Dropbox is used for filesharing 
among multiple users. Our results only show that within 
the last three to six months at least one Bittorrent user 
saved his downloads in Dropbox, respectively that since 
the .torrent has been created. No conclusions can be 
drawn as to whether they are saved in shared folders, or 
if only one person or possibly thousands of people uses 
Dropbox in that way. In fact, it is equally likely that a 
single person uses Dropbox to store these files. 


With our experiments regarding online slack space we 
showed that it is very easy to hide data on Dropbox with 
low accountability. It becomes rather trivial to get some 
of the advanced features of Dropbox like unlimited un- 
delete and versioning, without costs. Furthermore a ma- 
licious user can upload files without linking them to his 
account, resulting in possibly unlimited storage space 
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while at the same time possibly causing problems in a 
standard forensic examination. In an advanced setup, the 
examinator might be confronted with a computer that has 
no harddrive, booting from read only media such as a 
Linux live CD and saving all files in online slack space. 
No traces or local evidence would be extractable from the 
computer [15], which will be an issue in future forensic 
examinations. This is similar to using the private mode 
in modern browsers which do not save information lo- 
cally [8]. 


6 Keeping the cloud white 


To ensure trust in cloud storage operators it is vital to not 
only make sure that the untrusted cloud storage operator 
keeps the files secure with regards to availability [25], 
but also to ensure that the client cannot get attacked with 
these services. We provide generic security recommen- 
dations for all storage providers to prevent our attacks, 
and propose changes to the communication protocol of 
Dropbox to include data possession proofs that can be 
precalculated on the cloud storage operato’rs side and 
implemented efficiently as database lookups. 


6.1 Basic security primitives 


Our attacks are not only applicable to Dropbox, but 
to all cloud storage services where a server-side data 
deduplication scheme is used to prevent retransmission 
of files that are already stored at the provider. Current 
implementations are based on simple hashing. However, 
the client software cannot be trusted to calculate the 
hash value correctly and a stronger proof of ownership 
is needed. This is a new security aspect of cloud 
computing, as up till now mostly trust in the service 
operator was an issue, and not the client. 


To ensure that the client is in possession of a file, a 
strong protocol for provable data possession is needed, 
based on either cryptography or probabilistic proofs or 
both. This can be done by using a recent provable data 
possession algorithm such as [11], where the cloud stor- 
age operator selects which challenges the client has to 
answer to get access to the file on the server and thus 
omit the retransmission which is costly for both the client 
and the operator. Recent publications proposed different 
approaches with varying storage and computational over- 
head [12, 20, 10]. Furthermore every service should use 
SSL for all communication and data transfers, something 
which we observed was not the case with every service. 


USENIX Association 


6.2 Secure Dropbox 


To fix the discovered security issues in Dropbox we 
propose several steps to mitigate the risk of abuse. 
First of all, a secure data possession protocol should 
be used to prevent the clients to get access to files 
only by knowing the hash value of a file. Eventually 
every cloud storage operator should employ such a 
protocol if the client is not part of a trusted environment. 
We therefore propose the implementation of a simple 
challenge-response mechanism as outlined in Fig. 5. 
In essence: If the client transmits a hash value already 
known to the storage operator, the server has to verify 
if the client is in possession of the entire file or only 
the hash value. The server could do so by requesting 
randomly chosen bytes from the data during the upload 
process. Let H be a cryptographic hash function which 
maps data D of arbitrary length to fixed length hash 
value. 
Pushinix(U p(U) H(D)) is a function that initiates the 
upload of data D from the client to the server. The user 
U and an authentication token p(U) are sent along with 
the hash value H(D) of data D. Push(U p(U) D) is 
the actual uploading process of data D to the server. 
Req(U p(U) H(D)) is a function that requests data D 
from the server. 
Ver(Veror H(D)) is a function that requests ran- 
domly chosen bytes from data D by specifying their 
offsets in the array Vero pr. 

Uploading chunks without linking them to a users 


Dropbox should not be allowed, on the one hand to 
prevent clients to have unlimited storage capacity, on 
the other hand to make online slack space on Dropbox 
infeasible. In many scenarios it is still cheaper to just 
add storage capacity instead of finding a reliable metric 
on what data to delete - however, to prevent misuse of 
historic data and online slackspace, all chunks that are 
not linked to a file that is retrievable by a client should 
be deleted. 





‘storage management:process 
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server:machine 
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Figure 5: Data verification during upload 


To further enhance security several behavioral aspects 
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Security Measure 


Consequences 





1. Data possession protocol 

2. No chunks without linking 

3. Check for host ID activity 

4. Dynamic host ID 

5. Enforcement of data ownership 


Prevent hash manipulation attacks 
Defy online slack space 

Prevent access if host is not online 
Smaller window of opportunity 
No unauthorized data access 





Table 6: Security Improvements for Dropbox 


can be leveraged, for example to check for host ID 
activity - if a client turns on his computer he connects 
to Dropbox to see if any file has been updated or new 
files were added. Afterwards, only that IP address 
should be allowed to download files from that host IDs 
Dropbox. If the user changes IP e.g., by using a VPN 
or changing location, Dropbox needs to rebuild the 
connection anyway and could use that to link that host 
ID to that specific IP. In fact, the host ID should be used 
like a cookie [26] if used for authentication, dynamic 
in nature and changeable. A dynamic host ID would 
reduce the window of opportunity that an attacker could 
use to clone a victim’s Dropbox by stealing the host ID. 
Most importantly, Dropbox should keep track of which 
files are in which Dropboxes (enforcement of data 
ownership). If a client downloads a chunk that has not 
been in his or her Dropbox, this is easily detectable for 
Dropbox. 


Unfortunately we are unable to assess the performance 
impact and communication overhead of our mitigation 
strategies, but we believe that most of them can be im- 
plemented as simple database lookups. Different data 
possession algorithms have already been studied for their 
overhead, for example S-PDP and E-PDP from [11] are 
bounded by O(1). Table 6 summarizes all needed miti- 
gation steps to prevent our attacks. 


7 Conclusion 


In this paper we presented specific attacks on cloud stor- 
age operators where the attacker can download arbitrary 
files under certain conditions. We proved the feasibil- 
ity on the online storage provider Dropbox and showed 
that Dropbox is used heavily to store data from thepi- 
ratebay.org, a popular BitTorrent website. Furthermore 
we defined and evaluated online slack space and demon- 
strated that it can be used to hide files. We believe that 
these vulnerabilities are not specific to Dropbox, as the 
underlying communication protocol is straightforward 
and very likely to be adopted by other cloud storage op- 
erators to save bandwidth and storage overhead. The dis- 
cussed countermeasures, especially the data possession 
proof on the client side, should be included by all cloud 
storage operators. 
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Abstract 

Modern automobiles are pervasively computerized, and 
hence potentially vulnerable to attack. However, while 
previous research has shown that the internal networks 
within some modern cars are insecure, the associated 
threat model— requiring prior physical access —has 
justifiably been viewed as unrealistic. Thus, it remains an 
open question if automobiles can also be susceptible to 
remote compromise. Our work seeks to put this question 
to rest by systematically analyzing the external attack 
surface of a modern automobile. We discover that remote 
exploitation is feasible via a broad range of attack vectors 
(including mechanics tools, CD players, Bluetooth and 
cellular radio), and further, that wireless communications 
channels allow long distance vehicle control, location 
tracking, in-cabin audio exfiltration and theft. Finally, we 
discuss the structural characteristics of the automotive 
ecosystem that give rise to such problems and highlight 
the practical challenges in mitigating them. 


1 Introduction 
Modern cars are controlled by complex distributed com- 
puter systems comprising millions of lines of code execut- 
ing on tens of heterogeneous processors with rich connec- 
tivity provided by internal networks (e.g., CAN). While 
this structure has offered significant benefits to efficiency, 
safety and cost, it has also created the opportunity for new 
attacks. For example, in previous work we demonstrated 
that an attacker connected to a car’s internal network can 
circumvent all computer control systems, including safety 
critical elements such as the brakes and engine [14]. 
However, the threat model underlying past work 
(including our own) has been met with significant, and 
justifiable, criticism (e.g., [1, 3, 16]). In particular, it is 
widely felt that presupposing an attacker’s ability to physi- 
cally connect to a car’s internal computer network may be 
unrealistic. Moreover, it is often pointed out that attackers 
with physical access can easily mount non-computerized 
attacks as well (e.g., cutting the brake lines). 
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This situation suggests a significant gap in knowledge, 
and one with considerable practical import. To what ex- 
tent are external attacks possible, to what extent are they 
practical, and what vectors represent the greatest risks? 
Is the etiology of such vulnerabilities the same as for 
desktop software and can we think of defense in the same 
manner? Our research seeks to fill this knowledge gap 
through a systematic and empirical analysis of the remote 
attack surface of late model mass-production sedan. 

We make four principal contributions: 

Threat model characterization. We systematically 
synthesize a set of possible external attack vectors as 
a function of the attacker’s ability to deliver malicious 
input via particular modalities: indirect physical access, 
short-range wireless access, and long-range wireless 
access. Within each of these categories, we characterize 
the attack surface exposed in current automobiles and 
their surprisingly large set of I/O channels. 

Vulnerability analysis. For each access vector category, 
we investigate one or more concrete examples in depth 
and assess the level of actual exposure. In each case we 
find the existence of practically exploitable vulnerabilities 
that permit arbitrary automotive control without requiring 
direct physical access. Among these, we demonstrate the 
ability to compromise a car via vulnerable diagnostics 
equipment widely used by mechanics, through the media 
player via inadvertent playing of a specially modified 
song in WMA format, via vulnerabilities in hands-free 
Bluetooth functionality and, finally, by calling the car’s 
cellular modem and playing a carefully crafted audio 
signal encoding both an exploit and a bootstrap loader 
for additional remote-control functionality. 

Threat assessment. From these uncovered vulnerabili- 
ties, we consider the question of “utility” to an attacker: 
what capabilities does the vulnerability enable? Unique 
to this work, we study how an attacker might leverage a 
car’s external interfaces for post-compromise control. We 
demonstrate multiple post-compromise control channels 
(including TPMS wireless signals and FM radio), inter- 
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active remote control via the Internet and real-time data 
exfiltration of position, speed and surreptitious streaming 
of cabin audio (i.e., anything being said in the vehicle) to 
an outside recipient. Finally, we also explore potential at- 
tack scenarios and gauge whether these threats are purely 
conceptual or whether there are plausible motives that 
transform them into actual risks. In particular, we demon- 
strate complete capabilities for both theft and surveillance. 
Synthesis. On reflection, we noted that the vulnera- 
bilities we uncovered have surprising similarities. We 
believe that these are not mere coincidences, but that 
many of these security problems arise, in part, from 
systemic structural issues in the automotive ecosystem. 
Given these lessons, we make a set of concrete, pragmatic 
recommendations which significantly raise the bar for 
automotive system security. These recommendations are 
intended to “bridge the gap” until deeper architectural 
redesign can be carried out. 


2 Background and Related Work 
Modern automobiles are controlled by a heterogeneous 
combination of digital components. These components, 
Electronic Control Units (ECUs), oversee a broad range 
of functionality, including the drivetrain, brakes, lighting, 
and entertainment. Indeed, very few operations are not 
mediated by computer control in a modern vehicle (with 
the parking brake and steering being the last holdouts, 
though semi-automatic parallel parking capabilities are 
available in some vehicles and full steer-by-wire has been 
demonstrated in several concept cars). Charette estimates 
that a modern luxury vehicle includes up to 70 distinct 
ECUs including tens of millions of lines of code [5]. In 
turn, ECUs are interconnected by common wired net- 
works, usually a variant of the Controller Area Network 
(CAN) [12] or FlexRay bus [8]. This interconnection 
permits complex safety and convenience features such as 
pre-tensioning of seat-belts when a crash is predicted and 
automatically varying radio volume as a function of speed. 
At the same time, this architecture provides a broad 
internal attack surface since on a given bus each compo- 
nent has at least implicit access to every other component. 
Indeed, several research groups have described how 
this architecture might be exploited in the presence 
of compromised components [15, 24, 26, 27, 28] or 
demonstrated such exploits by spoofing messages to 
isolated components in the lab [10]. Most recently, 
our own group documented experiments on a complete 
automobile, demonstrating that if an adversary were 
able to communicate on one or more of a car’s internal 
network buses, then this capability could be sufficient 
to maliciously control critical components across the 
entire car (including dangerous behavior such as forcibly 
engaging or disengaging individual brakes independent of 
driver input) [14]. However, these results raise the ques- 
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tion of how an adversary might be able to access a car’s 
internal bus (and thus compromise its ECUs) absent direct 
physical access, a question that we answer in this paper. 

About the latter question — understanding the external 
attack surface of modern vehicles—there has been 
far less research work. Among the exceptions is Rouf 
et al.’s recent analysis of the wireless Tire Pressure 
Monitoring System (TPMS) in a modern vehicle [22]. 
While their work was primarily focused on the privacy 
implications of TPMS broadcasts, they also described 
methods for manipulating drivers by spoofing erroneous 
tire pressure readings and, most relevant to our work, 
an experience in which they accidentally caused the 
ECU managing TPMS data to stop functioning through 
wireless signals alone. Still others have focused on the 
computer security issues around car theft, including 
Francillon et al.’s recent demonstration of relay attacks 
against keyless entry systems [9], and the many attacks 
on the RFID-based protocols used by engine immobi- 
lizers to identify the presence of a valid ignition key, 
e.g., [2, 6, 11]. Orthogonally, there has been work that 
considers the future security issues (and expanded attack 
surface) associated with proposed vehicle-to-vehicle 
(V2V) systems (sometimes also called vehicular ad-hoc 
networks, or VANETs) [4, 13, 21]. To the best of our 
knowledge, however, we are the first to consider the full 
external attack surface of the contemporary automobile, 
characterize the threat models under which this surface is 
exposed, and experimentally demonstrate the practicality 
of remote threats, remote control, and remote data 
exfiltration. Our experience further gives us the vantage 
point to reflect on some of the ecosystem challenges that 
give rise to these problems and point the way forward 
to better secure the automotive platform in the future. 


3 Automotive threat models 
While past work has illuminated specific classes of threats 
to automotive systems — such as the technical security 
properties of their internal networks [14, 15, 24, 26, 27, 
28] — we believe that it is critical for future work to place 
specific threats and defenses in the context of the entire 
automotive platform. In this section, we aim to bootstrap 
such a comprehensive treatment by characterizing the 
threat model for a modern automobile. Though we 
present it first, our threat model is informed significantly 
by the experimental investigations we carried out, which 
are described in subsequent sections. 

In defining our threat model, we distinguish between 
technical capabilities and operational capabilities. 

Technical capabilities describe our assumptions con- 
cerning what the adversary knows about its target vehicles 
as well as her ability to analyze these systems to develop 
malicious inputs for various I/O channels. For example, 
we assume that the adversary has access to an instance of 
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the automobile model being targeted and has the technical 
skill to reverse engineer the appropriate subsystems and 
protocols (or is able to purchase such information from 
a third-party). Moreover, we assume she is able to obtain 
the appropriate hardware or medium to transmit messages 
whose encoding is appropriate for any given channel.! 
When encountering cryptographic controls, we also 
assume that the adversary is computationally bounded 
and cannot efficiently brute force large shared secrets, 
such as large symmetric encryption keys. In general, we 
assume that the attacker only has access to information 
that can be directly gleaned from examining the systems 
of a vehicle similar to the one being targeted.” We believe 
these assumptions are quite minimal and mimic the 
access afforded to us when conducting this work. 

By contrast, operational capabilities characterize the 
adversary’s requirements in delivering a malicious input 
to a particular access vector in the field. In considering 
the full range of I/O capabilities present in a modern 
vehicle, we identify the qualitative differences in the 
challenges required to access each channel. These in 
turn can be roughly classified into three categories: 
indirect physical access, short-range wireless access, 
and long-range wireless access. In the remainder of this 
section we explore the threat model for each of these 
categories and within each we synthesize the “attack 
surface” presented by the full range of I/O channels 
present in today’s automobiles. Figure 1 highlights where 
I/O channels exist on a modern automobile today. 


3.1 Indirect physical access 

Modern automobiles provide several physical interfaces 
that either directly or indirectly access the car’s internal 
networks. We consider the full physical attack surface 
here, under the constraint that the adversary may not 
directly access these physical interfaces herself but must 
instead work through some intermediary. 

OBD-II. The most significant automotive interface is 
the OBD-II port, federally mandated in the U.S., which 
typically provides direct access to the automobile’s 
key CAN buses and can provide sufficient access to 
compromise the full range of automotive systems [14]. 
While our threat model forbids the adversary from direct 
access herself, we note that the OBD-II port is commonly 


For the concrete vulnerabilities we will explore, the hardware 
cost for such capabilities is modest, requiring only commodity laptop 
computers, an audio card, a USB-to-CAN interface, and, in a few 
instances, an inexpensive, off-the-shelf USRP software radio platform. 

2A question which we do not consider in this work is the extent to 
which the attack surface is “portable” between vehicle models from 
a given manufacturer. There is significant evidence that some such 
attacks are portable as manufacturers prefer to build a small number 
of underlying platforms that are specialized to deliver model-specific 
features, but we are not in a position to evaluate this question compre- 
hensively. 
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Figure 1: Digital I/O channels appearing on a modern 
car. Colors indicate rough grouping of ECUs by function. 


accessed by service personnel during routine maintenance 
for both diagnostics and ECU programming. 

Historically this access is achieved using dedicated 
handheld “scan” tools such as Ford’s NGS, Nissan’s 
Consult I and Toyota’s Diagnostic Tester which are 
themselves programmed via Windows-based personal 
computers. For modern vehicles, most manufacturers 
have adopted an approach that is PC-centric. Under this 
model, a laptop computer interfaces with a “PassThru” 
device (typically directly via USB or WiFi) that in turn 
is plugged into the car’s OBD-II port. Software on the 
laptop computer can then interrogate or program the car’s 
ECUs via this device (typically using the standard SAE 
J2534 API). Examples of such tools include Toyota’s 
TIS, Ford’s VCM, Nissan’s Consult 3 and Honda’s HDS 
among others. 

In both situations Windows-based computers directly 
or indirectly control the data to be sent to the automobile. 
Thus, if an adversary were able to compromise such 
systems at the dealership she could amplify this access to 
attack any cars under service. Such laptop computers are 
typically Internet-connected (indeed, this is a requirement 
for some manufacturers’ systems), so traditional means 
of personal computer compromise could be employed. 

Further afield, electric vehicles may also communicate 
with external chargers via the charging cable. An 
adversary able to compromise the external charging 
infrastructure may thus be able to leverage that access 
to subsequently attack any connected automobile. 
Entertainment: Disc, USB and iPod. The other 
important class of physical interfaces are focused on 
entertainment systems. Virtually all automobiles shipped 
today provide a CD player able to interpret a wide 
variety of audio formats (raw “Red Book” audio, MP3, 
WMA, and so on). Similarly, vehicle manufacturers also 
provide some kind of external digital multimedia port 
(typically either a USB port or an iPod/iPhone docking 
port) for allowing users to control their car’s media 
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system using their personal audio player or phone. Some 
manufacturers have widened this interface further; BMW 
and Mini recently announced their support for “iPod Out,” 
a scheme whereby Apple media devices will be able to 
control the display on the car’s console. 

Consequently, an adversary might deliver malicious 
input by encoding it onto a CD or as a song file and 
using social engineering to convince the user to play it. 
Alternatively, she might compromise the user’s phone or 
iPod out of band and install software onto it that attacks 
the car’s media system when connected. 

Taking over a CD player alone is a limited threat; but, 
for a variety of reasons, automotive media systems are 
not standalone devices. Indeed, many such systems are 
now CAN bus interconnected, either to directly interface 
with other automotive systems (e.g., to support chimes, 
certain hands-free features, or to display messages on 
the console) or simply to support a common maintenance 
path for updating all ECU firmware. Thus, counterintu- 
itively, a compromised CD player can offer an effective 
vector for attacking other automotive components. 


3.2 Short-range wireless access 

Indirect physical access has a range of drawbacks in- 
cluding its operational complexity, challenges in precise 
targeting, and the inability to control the time of compro- 
mise. Here we weaken the operational requirements on 
the attacker and consider the attack surface for automotive 
wireless interfaces that operate over short ranges. These 
include Bluetooth, Remote Keyless Entry, RFIDs, Tire 
Pressure Monitoring Systems, WiFi, and Dedicated Short- 
Range Communications. For this portion of the attack 
surface we assume that the adversary is able to place 
a wireless transmitter in proximity to the car’s receiver 
(between 5 and 300 meters depending on the channel). 
Bluetooth. Bluetooth has become the de facto standard 
for supporting hands-free calling in automobiles and 
is standard in mainstream vehicles sold by all major 
automobile manufacturers. While the lowest level of the 
Bluetooth protocol is typically implemented in hardware, 
the management and services component of the Bluetooth 
stack is often implemented in software. In normal usage, 
the Class 2 devices used in automotive implementations 
have a range of 10 meters, but others have demonstrated 
that this range can be extended through amplifiers and 
directional antennas [20]. 

Remote Keyless Entry. Today, all but entry-level 
automobiles shipped in the U.S. use RF-based remote 
keyless entry (RKE) systems to remotely open doors, 
activate alarms, flash lights and, in some cases, start the 
ignition (all typically using digital signals encoded over 
315 MHz in the U.S. and 433 MHz in Europe). 

Tire pressure. In the U.S., all 2007 model year and 
newer cars are required to support a Tire Pressure Moni- 
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toring System (TPMS) to alert drivers about under or over 
inflated tires. The most common form of such systems, so- 
called “Direct TPMS,” uses rotating sensors that transmit 
digital telemetry (frequently in similar bands as RKEs). 
RFID car keys. RFID-based vehicle immobilizers 
are now nearly ubiquitous in modern automobiles and 
are mandatory in many countries throughout the world. 
These systems embed an RFID tag in a key or key fob 
and a reader in or near the car’s steering column. These 
systems can prevent the car from operating unless the 
correct key (as verified by the presence of the correct 
RFID tag) is present. 

Emerging short-range channels. A number of manu- 
facturers have started to discuss providing 802.11 WiFi 
access in their automobiles, typically to provide “hotspot” 
Internet access via bridging to a cellular 3G data link. 
In particular, Ford offers this capability in the 2012 
Ford Focus. (Several 2011 models also provided WiFi 
receivers, but we understand they were used primarily 
for assembly line programming.) 

Finally, while not currently deployed, an emerging 

wireless channel is defined in the Dedicated Short-Range 
Communications (DSRC) standard, which is being 
incorporated into proposed standards for Cooperative 
Collision Warning/Avoidance and Cooperative Cruise 
Control. Representative programs in the U.S. include the 
Department of Transportation’s Cooperative Intersection 
Collision Avoidance Systems (CICAS-V) and the Vehicle 
Safety Communications Consortium’s VSC-A project. 
In such systems, forward vehicles communicate digitally 
to trailing cars to inform them of sudden changes in 
acceleration to support improved collision avoidance and 
harm reduction. 
Summary. For all of these channels, if a vulnerability ex- 
ists in the ECU software responsible for parsing channel 
messages, then an adversary may compromise the ECU 
(and by extension the entire vehicle) simply by transmit- 
ting a malicious input within the automobile’s vicinity. 


3.3. Long-range wireless 

Finally, automobiles increasingly include long distance 
(greater than | km) digital access channels as well. These 
tend to fall into two categories: broadcast channels and 
addressable channels. 

Broadcast channels. Broadcast channels are chan- 
nels that are not specifically directed towards a given 
automobile but can be “tuned into” by receivers on- 
demand. In addition to being part of the external at- 
tack surface, long-range broadcast mediums can be 
appealing as control channels (i.e., for triggering at- 
tacks) because they are difficult to attribute, can com- 
mand multiple receivers at once, and do not require 
attackers to obtain precise addressing for their vic- 
tims. 
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The modern automobile includes a plethora of broad- 
cast receivers for long-range signals: Global Positioning 
System (GPS),? Satellite Radio (e.g., SirlusXM receivers 
common to late-model vehicles from Honda/Accura, GM, 
Toyota, Saab, Ford, Kia, BMW and Audi), Digital Radio 
(including the U.S. HD Radio system, standard on 2011 
Ford and Volvo models, and Europe’s DAB offered in 
Ford, Audi, Mercedes, Volvo and Toyota among others), 
and the Radio Data System (RDS) and Traffic Message 
Channel (TMC) signals transmitted as digital subcarriers 
on existing FM-bands. 

The range of such signals depends on transmitter 

power, modulation, terrain, and interference. As an 
example, a 5 W RDS transmitter can be expected to 
deliver its 1.2 kbps signal reliably over distances up 
to 10 km. In general, these channels are implemented in 
an automobile’s media system (radio, CD player, satellite 
receiver) which, as mentioned previously, frequently 
provides access via internal automotive networks to other 
key automotive ECUs. 
Addressable channels. Perhaps the most important part 
of the long-range wireless attack surface is that exposed 
by the remote telematics systems (e.g., Ford’s Sync, 
GM’s OnStar, Toyota’s SafetyConnect, Lexus’ Enform, 
BMW’s BMW Assist, and Mercedes-Benz’ mbrace) that 
provide continuous connectivity via cellular voice and 
data networks. These systems provide a broad range of 
features supporting safety (crash reporting), diagnostics 
(early alert of mechanical issues), anti-theft (remote track 
and disable), and convenience (hands-free data access 
such as driving directions or weather). 

These cellular channels offer many advantages for 

attackers. They can be accessed over arbitrary distance 
(due to the wide coverage of cellular data infrastructure) 
in a largely anonymous fashion, typically have relatively 
high bandwidth, are two-way channels (supporting inter- 
active control and data exfiltration), and are individually 
addressable. 
Stepping back. There is a significant knowledge gap 
between these possible threats and what is known to 
date about automotive security. Given this knowledge 
gap, much of this threat model may seem far-fetched. 
However, in the next section of this paper we find quite 
the opposite. For each category of access vector we 
will explore one or two aspects of the attack surface 
deeply, identify concrete vulnerabilities, and explore and 
demonstrate practical attacks that are able to completely 
compromise our target automobile’s systems without 
requiring direct physical access. 


3We do not currently consider GPS to be a practical access vector 
for an attacker because in all automotive implementations we are aware 
of, GPS signals are processed predominantly in custom hardware. By 
contrast, we have identified significant software-based input processing 
in other long-range wireless receivers. 
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4 Vulnerability Analysis 

We now turn to our experimental exploration of the 
attack surface. We first describe the automobile and 
key components under evaluation and provide some 
context for the tools and methods we employed. We then 
explore in-depth examples of vulnerabilities via indirect 
physical channels (CDs and service visits), short-range 
wireless channels (Bluetooth), and long-range wireless 
(cellular). Table 1 summarizes these results as well as our 
qualitative assessment of the cost (in effort) to discover 
and exploit these vulnerabilities. 


4.1 Experimental context 
All of our experimental work focuses on a moderately 
priced late model sedan with the standard options and 
components. Between 100,000 and 200,000 of this 
model were produced in the year of manufacture. The 
car includes less than 30 ECUs comprising both critical 
drivetrain components as well as less critical components 
such as windshield wipers, door locks and entertainment 
functions. These ECUs are interconnected via multiple 
CAN buses, bridged where necessary. The car exposes 
a number of external vectors including the OBD-II port, 
media player, Bluetooth, wireless TPMS sensors, keyless 
entry, satellite radio, RDS, and a telematics unit. The 
last provides voice and data access via cellular networks, 
connects to all CAN buses, and has access to Bluetooth, 
GPS and independent hands-free audio functionality (via 
an embedded microphone in the passenger cabin). We 
also obtained the manufacturer’s standard “PassThru” 
device used by dealerships and service stations for 
ECU diagnosis and reprogramming, as well as the 
associated programming software. For several ECUs, 
notably the media and telematics units, we purchased a 
number of identical replacement units via on-line markets 
to accommodate the inevitable “bricking” caused by 
imperfect attempts at code injection. 

Building on our previous work, we first established 
a set of messages and signals that could be sent on our 
car’s CAN bus (via OBD-II) to control key components 
(e.g., lights, locks, brakes, and engine) as well as injecting 
code into key ECUs to insert persistent capabilities and to 
bridge across multiple CAN buses [14]. Note, such inter- 
bus bridging is critical to many of the attacks we explore 
since it exposes the attack surface of one set of compo- 
nents to components on a separate bus; we explain briefly 
here. Most vehicles implement multiple buses, each of 
which host a subset of the ECUs.* However, for func- 


“In prior work we hypothesized that CAN buses were purposely 
separated for security reasons — one for safety-critical components like 
the radio and engine and the other for less important components such 
as a radio. Based on discussions with industry experts we have learned 
that this separation has until now often been driven by bandwidth and 
integration concerns and not necessarily security. 
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Vulnerability Implemented Visible Full 
Class Channel Capability to User Scale Control Cost Section 
Direct physical OBD-II port Plug attack hardware directly intocar Yes Small Yes Low Prior work [14] 
OBD-II port 
Indirect physical CD CD-based firmware update Yes Small Yes Medium Section 4.2 
CD Special song (WMA) Yes* Medium Yes Medium-High — Section 4.2 
PassThru WiFi or wired control connection to No Small Yes Low Section 4.2 
advertised PassThru devices 
PassThru WiFi or wired shell injection No Viral Yes Low Section 4.2 
Short-range Bluetooth Buffer overflow with paired Android No Large Yes Low-Medium Section 4.3 
wireless phone and Trojan app 
Bluetooth Sniff MAC address, brute force PIN, No Small Yes Low-Medium Section 4.3 
buffer overflow 
Long-range Cellular Call car, authentication exploit, buffer No Large Yes Medium-High — Section 4.4 
wireless overflow (using laptop) 
Cellular Call car, authentication exploit, buffer No Large Yes Medium-High — Section 4.4 


overflow (using iPod with exploit au- 
dio file, earphones, and a telephone) 


Table 1: Attack surface capabilities. The Visible to User column indicates whether the compromise process is visible to the 
user (the driver or the technician); we discuss social engineering attacks for navigating user detection in the body. For (*), 
users will perceive a malfunctioning CD. The Scale column captures the approximate scale of the attack, e.g., the CD firmware 
update attack is small-scale because it requires distributing a CD to each target car. The Full Control column indicates whether 
this exploit yields full control over the component’s connected CAN bus (and, by transitivity, all the ECUs in the car). Finally, 
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the Cost column captures the approximate effort to develop these attack capabilities. 


tionality reasons these buses must be interconnected to 
support the complex coupling between pairs of ECUs and 
thus a small number of ECUs are physically connected to 
multiple buses and act as logical bridges. Consequently, 
by modifying the “bridge” ECUs (either via a vulnerabil- 
ity or simply by reflashing them over the CAN bus as they 
are designed to be) an attacker can amplify an attack on 
one bus to gain access to components on another. Con- 
sequently, the result is that compromising any ECU with 
access to some CAN bus on our vehicle (e.g., the media 
player) is sufficient to compromise the entire vehicle. 

Combining these ECU control and bridging com- 
ponents, we constructed a general “payload” that we 
attempted to deliver in our subsequent experiments 
with the external attack surface.> To be clear, for every 
vulnerability we demonstrate, we are able to obtain 
complete control over the vehicle’s systems. We did 
not explore weaker attacks. 

For each ECU we consider, our experimental approach 
was to extract its firmware and then explicitly reverse 
engineering its I/O code and data flow using disassembly, 
interactive logging and debugging tools where appropri- 
ate. In most cases, extracting the firmware was possible 
directly via the CAN bus (this was especially convenient 
because in most ECUs we encountered, the flash chips 
are not socketed and while we were able to desolder and 
read such chips directly, the process was quite painful). 

Having the firmware in hand, we performed three basic 
types of analysis: raw code analysis, in situ observations, 


5In this work we experimented with two equivalent vehicles to ensure 
that our results were not tied to artifacts of a particular vehicle instance. 
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and interactive debugging with controlled inputs on the 
bench. In the first case, we identified the microprocessor 
(e.g., different components described in this paper use 
System on Chip (SoC) variants of the PowerPC, ARM, 
Super-H and other architectures) and used the industry- 
standard IDA Pro disassembler to map control flow and 
identify potential vulnerabilities, as well as debugging 
and logging options that could be enabled to aid in 
reverse engineering.° In situ observation with logging 
enabled allowed us to understand normal operation of the 
ECU and let us concentrate on potential vulnerabilities 
near commonly used code paths. Finally, ECUs were 
removed from the car and placed into a test harness on the 
bench from which we could carefully control all inputs 
and monitor outputs. In this environment, interactive 
debuggers were used to examine memory and single step 
through vulnerable code under repeatable conditions. For 
one such device, the Super-H-based media player, we 
resorted to writing our own native debugger and exported 
a control and output interface through an unused serial 
UART interface we “broke out” off the circuit board. 

In general, we made use of any native debugging I/O 
we could identify. For example, like the media player, 
the telematics unit exposed an unused UART that we 
tapped to monitor internal debugging messages as we 
interactively probed its I/O channels. In other cases, we 


IDA Pro does not support embedded architectures as well as x86 
and consequently we needed to modify IDA Pro to correctly parse 
the full instruction set and object format of the target system. In one 
particular case (for the TPMS processor) IDA Pro did not provide any 
native support and we were forced to write a complete architecture 
module in order to use the tool. 
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selectively rewrote ECU memory (via the CAN bus or by 
exploiting software vulnerabilities) or rewrote portions 
of the flash chips using the manufacturer-standard ECU 
programming tools. For the telematics unit, we wrote 
a new character driver that exported a command shell 
to its Unix-like operating system directly over the OBD- 
II port to enable interactive debugging in a live vehicle. 
In the end, our experience was that although the ECU 
environment was somewhat more challenging than that 
of desktop operating systems, it was surmountable with 
dedicated effort. 


4.2 Indirect physical channels 

We consider two distinct indirect physical vectors in 
detail: the media player (via the CD player) and service 
access to the OBD-II port. We describe each in turn 
along with examples of when an adversary might be able 
to deliver malicious input. 

Media player. The media player in our car is fairly 
typical, receiving a variety of wireless broadcast signals, 
including analog AM and FM as well as digital signals 
via FM sub-carriers (RDS, called RBDS in the U.S.) and 
satellite radio. The media player also accepts standard 
compact discs (via physical insertion) and decodes audio 
encoded in a number of formats including raw Red Book 
audio as well as MP3 and WMA files encoded on an 
ISO 9660 filesystem. 

The media player unit itself is manufactured by a 
major supplier of entertainment systems, both stock units 
directly targeted for automobile manufacturers as well 
as branded systems sold via the aftermarket. Software 
running on the CPU handles audio parsing and playback 
requests, UI functions, and directly handles connections 
to the CAN bus. 

We found two vulnerabilities. First, we identified a 
latent update capability in the media player that will 
automatically recognize an ISO 9660-formatted CD with 
a particularly named file, present the user with a cryptic 
message and, if the user does not press the appropriate 
button, will then reflash the unit with the data contained 
therein.’ Second, knowing that the media player can 
parse complex files, we examined the firmware for input 
vulnerabilities that would allow us to construct a file that, 
if played, gives us the ability to execute arbitrary code. 

For the latter, we reverse-engineered large parts of the 
media player firmware, identifying the file system code 
as well as the MP3 and WMA parsers. In doing so, we 
documented that one of the file read functions makes 
strong assumptions about input length and moreover that 
there is a path through the WMA parser (for handling 
an undocumented aspect of the file format) that allows 

7This is not the standard method that the manufacturer uses to 


update the media player software and thus we believe this is likely a 
vestigial capability in the supplier’s code base. 
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arbitrary length reads to be specified; together these allow 
a buffer overflow. 

This particular vulnerability is not trivial to exploit. 
The buffer that is overflowed is not on the stack but 
in a BSS segment, without clear control data variables 
to hijack. Moreover, immediately after the buffer are 
several dynamic state variables whose values are con- 
tinually checked and crash the system when overwritten 
arbitrarily. 

To overcome these and other obstacles, we developed 
a native in-system debugger that communicates over an 
unused serial port we identified on the media player. This 
debugger lets us dump and alter memory, set breakpoints, 
and catch exceptions. Using this debugger we were 
able to find several nearby dynamic function pointers 
to overwrite as well as appropriate contents for the 
intervening state variables. 

We modified a WMA audio file such that, when burned 

onto a CD, plays perfectly on a PC but sends arbitrary 
CAN packets of our choosing when played by our car’s 
media player. This functionality adds only a small space 
overhead to the WMA file. One can easily imagine many 
scenarios where such an audio file might find its way into 
a user’s media collection, such as being spread through 
peer-to-peer networks. 
OBD-II. The OBD-II port can access all CAN buses in 
the vehicle. This is standard functionality because the 
OBD-II port is the principal means by which service 
technicians diagnose and update individual ECUs in a 
vehicle. This process is intermediated by hardware tools 
(sold both by automobile manufacturers and third parties) 
that plug into the OBD-II port and can then be used 
to upgrade ECUs’ firmware or to perform a myriad of 
diagnostic tasks such as checking the diagnostic trouble 
codes (DTCs). 

Since 2004, the Environmental Protection Agency 
has mandated that all new cars in the U.S. support the 
SAE J2534 “PassThru” standard—a Windows API 
that provides a standard, programmatic interface to 
communicate with a car’s internal buses. This is typically 
implemented as a Windows DLL that communicates 
over a wired or wireless network with the reprogram- 
ming/diagnostic tool (hereafter we refer to the latter 
simply as “the PassThru device”). The PassThru device 
itself plugs into the OBD-II port in the car and from that 
vantage point can communicate on the vehicle’s internal 
networks under the direction of software commands sent 
via the J2534 API. In this way, applications developed 
independently of the particular PassThru device can be 
used for reprogramming or diagnostics. 

We studied the most commonly used PassThru device 
for our car, manufactured by a well-known automotive 
electronics supplier on an OEM basis (the same device 
can be used for all current makes and models from 
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the same automobile manufacturer). The device itself 
is roughly the size of a paperback book and consists 
of a popular SoC microprocessor running a variant of 
Linux as well as multiple network interfaces, including 
USB and WiFi—and a connector for plugging into 
the car’s OBD-II port.’ We discovered two classes of 
vulnerabilities with this device. First, we find that an 
attacker on the same WiFi network as the PassThru 
device can easily connect to it and, if the PassThru 
device is also connected to a car, obtain control over 
the car’s reprogramming. Second, we find it possible to 
compromise the PassThru device itself, implant malicious 
code, and thereby affect a far greater number of vehicles. 
To be clear, these are vulnerabilities in the PassThru 
device itself, not the Windows software which normally 
communicates with it. We experimentally evaluated both 
vulnerability classes and elaborate on our analyses below. 

After booting up, the device periodically advertises 
its presence by sending a UDP multicast packet on each 
network to which it is connected, communicating both 
its IP address and a TCP port for receiving client requests. 
Client applications using the PassThru DLL connect to 
the advertised port and can then configure the PassThru 
device or command it to begin communicating with the 
vehicle. Communication between the client application 
and the PassThru device is unauthenticated and thus 
depends exclusively on external network security for 
any access control. Indeed, in its recommended mode 
of deployment, any PassThru device should be directly 
accessible by any dealership computer. A limitation is 
that only a single application can communicate with a 
given PassThru device at a time, and thus the attacker 
must wait for the device to be connected but not in use. 

The PassThru device exports a proprietary, unauthen- 
ticated API for configuring its network state (e.g., for 
setting with which WiFi SSID it should associate). We 
identified input validation bugs in the implementation of 
this protocol that allow an attacker to run arbitrary Bourne 
Shell commands via shell-injection, thus compromising 
the unit. The underlying Linux distribution includes pro- 
grams such as telnetd, ftp, and nc so, having gained 
entry to the device via shell injection, it is trivial for the 
attacker to open access for inbound telnet connections 
(exacerbated by a poor choice of root password) and then 
transfer additional data or code as necessary. 

To evaluate the utility of this vulnerability and make 
it concrete, we built a program that combines all of these 
steps. It contacts any PassThru devices being advertised 
(e.g., via their WiFi connectivity or if connected directly 
via Ethernet), exploits them via shell injection, and 


8The manufacturer’s dealership guidelines recommend the use of 
the WiFi interface, thereby supporting an easier tetherless mode of use, 
and suggest the use of link-layer protection such as WEP (or, in the 
latest release of the device, WPA2) to prevent outside access. 
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Figure 2: PassThru-based shell-injection exploit scenario. 
The adversary gains access to the service center network 
(e.g., by compromising an employee laptop), then (1) 
compromises any PassThru devices on the network, each 
of which compromise any cars they are used to service 
(2 and 3), installing Trojan horses to be activated based 
on some environmental trigger. The PassThru device also 
(4) spreads virally to other PassThru devices (e.g., if a 
device is loaned to other shops) which can repeat the same 
process (5). 


installs a malicious binary (modifying startup scripts so 
it is always enabled). The malicious binary will send 
pre-programmed messages over the CAN bus whenever 
a technician connects the PassThru device to a car. These 
CAN packets install malware onto the car’s telematics 
unit. This malware waits for an environmental trigger 
(e.g., specific date and time) before performing some 
action. Figure 2 gives a pictorial overview of this attack. 

To summarize, an attacker who can connect to a 
dealership’s wireless network (e.g., via social engineering 
or a worm/Vvirus a la Stuxnet [7]) is able to subvert any 
active PassThru devices that will in turn compromise any 
vehicles to which they connect. Moreover, the PassThru 
device is sufficiently general to mount the attack itself. To 
demonstrate this, we have modified our program, turning 
it into a worm that actively seeks out and spreads to other 
PassThru devices in range. This attack does not require 
interactivity with the attacker and can be fully automated. 


4.3 Short-range wireless channels: Bluetooth 
We now turn to short-range wireless channels and focus 
on one in particular: Bluetooth. Like many modern cars, 
ours has built-in Bluetooth capabilities which allow the 
occupants’ cell phones to connect to the car (e.g., to 
enable hands-free calling). These Bluetooth capabilities 
are built into our car’s telematics unit. 

Through reverse engineering, we gained access to 
the telematics ECU’s Unix-like operating system and 
identified the particular program responsible for handling 
Bluetooth functionality. By analyzing the program’s 
symbols we established that it contains a copy of a 
popular embedded implementation of the Bluetooth 
protocol stack and a sample hands-free application. 
However, the interface to this program and the rest of the 
telematics system appear to be custom-built. It is in this 
custom interface code that we found evidence of likely 
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vulnerabilities. Specifically, we observed over 20 calls 
to st rcpy, none of which were clearly safe. We inves- 
tigated the first such instance in depth and discovered an 
easily exploitable unchecked st rcpy to the stack when 
handling a Bluetooth configuration command.° Thus, any 
paired Bluetooth device can exploit this vulnerability to 
execute arbitrary code on the telematics unit. 

As with our indirect physical channel investigations, 
we establish the utility of this vulnerability by making it 
concrete. We explore two practical methods for exploiting 
this attack and in doing so unearth two sub-classes of the 
short-range wireless attack vector: indirect short-range 
wireless attacks and direct short-range wireless attacks. 
Indirect short-range wireless attacks. The vulnerabil- 
ity we identified requires the attacker to have a paired 
Bluetooth device. It may be challenging for an attacker to 
pair her own device with the car’s Bluetooth system —a 
challenge we consider in the direct short-range wireless 
attacks discussion below. However, the car’s Bluetooth 
subsystem was explicitly designed to support hands-free 
calling and thus may naturally be paired with one or 
more smartphones. We conjecture that if an attacker can 
independently compromise one of those smartphones, 
then the attacker can leverage the smartphone as a 
stepping-stone for compromising the car’s telematics 
unit, and thus all the critical ECUs on the car. 

To assess this attack vector we implemented a simple 

Trojan Horse application on the HTC Dream (G1) phone 
running Android 2.1. The application appears to be in- 
nocuous but under the hood monitors for new Bluetooth 
connections, checks to see if the other party is a telemat- 
ics unit (our unit identifies itself by the car manufacturer 
name), and if so sends our attack payload. While we 
have not attempted to upload our code to the Android 
Market, there is evidence that other Trojan applications 
have been successfully uploaded [25]. Additionally, there 
are known exploits that can compromise Android and 
iPhone devices that visit malicious Web sites. Thus our 
assessment suggests that smartphones can be a viable 
path for exploiting a car’s short-range wireless Bluetooth 
vulnerabilities. 
Direct short-range wireless attacks. We next assess 
whether an attacker can remotely exploit the Bluetooth 
vulnerability without access to a paired device. Our 
experimental analyses found that a determined attacker 
can do so, albeit in exchange for a significant effort in 
development time and an extended period of proximity 
to the vehicle. 

There are two steps precipitating a successful attack. 
First, the attacker must learn the car’s Bluetooth MAC 


Because the size of the available buffer is small, our exploit simply 
creates a new shell on the telematics unit from which it downloads and 
executes more complex code from the Internet via the unit’s built-in 
3G data capabilities. 
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address. Second, the attacker must surreptitiously pair his 
or her own device with the car. Experimentally, we find 
that we can use the open source Bluesniff [23] package 
and a USRP-based software radio to sniff our car’s Blue- 
tooth MAC address when the car is started in the presence 
of a previously paired device (e.g., when the driver turns 
on the car while carrying her cell phone). We were also 
able to discover the car’s Bluetooth MAC address by 
sniffing the Bluetooth traffic generated when one of the 
devices, which has previously been paired to a car, has its 
Bluetooth unit enabled, regardless of the presence of the 
car — all of the devices we experimented with scanned 
for paired devices upon Bluetooth initialization. 

Given the MAC address, the other requirement for 
pairing is possessing a shared secret (the PIN). Under 
normal use, if the driver wishes to pair a new device, she 
puts the car into pairing mode via a well-documented 
user interface, and, in turn, the car provides a random PIN 
(regenerated each time the car starts or when the driver 
initiates the normal pairing mode) which is then shown 
on the dashboard and must then be manually entered into 
the phone. However, we have discovered that our car’s 
Bluetooth unit will respond to pairing requests even with- 
out any user interaction. Using a simple laptop to issue 
pairing requests, we are thus able to brute force this PIN 
at a rate of eight to nine PINs per minute, for an average 
of approximately 10 hours per car; this rate is limited 
entirely by the response time of the vehicle’s Bluetooth 
stack. We conducted three empirical trials against our car 
(resetting the car each time to ensure that a new PIN was 
generated) and found that we could pair with the car after 
approximately 13.5, 12.5, and 0.25 hours, respectively. 
The pairing process does not require any driver inter- 
vention and will happen completely obliviously to any 
person in the car.!° While this attack is time consuming 
and requires the car(s) under attack to be running, it is 
also parallelizable, e.g., an attacker could sniff the MAC 
addresses of all cars started in a parking garage at the 
end of a day (assuming the cars are pre-paired with at 
least one Bluetooth device). If a thousand such cars leave 
the parking garage in a day, then we expect to be able to 
brute force the PIN for at least one car within a minute. 

After completing this pairing, the attacker can inject on 
the paired channel an exploit like the one we developed 
and thus compromise the vehicle. 


4.4 Long-range wireless channels: Cellular 

Finally, we consider long-range wireless channels and, 
in particular, focus on the cellular capabilities built into 
our car’s telematics unit. Like many modern cars, our 
car’s cellular capabilities facilitate a variety of safety 


10 As an artifact of how this “blind” pairing works, the paired device 
does not appear on the driver’s list of paired devices and cannot be 
unpaired manually. 
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and convenience features (e.g., the car can automatically 
call for help if it detects a crash). However, long-range 
communications channels also offer an obvious target 
for potential attackers, which we explore here. In this 
section, we describe how these channels operate, how 
they were reverse engineered and demonstrate that 
a combination of software flaws conspire to allow a 
completely remote compromise via the cellular voice 
channel. We focus on adversarial actions that leverage 
the existing cellular infrastructure, not ones that involve 
the use of adversarially-controlled infrastructure; e.g., we 
do not consider man-in-the-middle attacks. 

Telematics connectivity. For wide-area connectivity, 
our telematics unit is equipped with a cell phone interface 
(supporting voice, SMS and 3G data). While the unit 
uses its 3G data channel for a variety of Internet-based 
functions (e.g., navigation and location-based services), 
it relies on the voice channel for critical telematics 
functions (e.g., crash notification) because this medium 
can provide connectivity over the widest possible service 
area (i.e., including areas where 3G service is not 
yet available). To synthesize a digital channel in this 
environment, the manufacturer uses Airbiquity’s aqLink 
software modem to covert between analog waveforms and 
digital bits. This use of the voice channel in general, and 
the aqLink software in particular, is common to virtually 
all popular North American telematics offerings today. 

In our vehicle, Airbiquity’s software is used to create a 

reliable data connection between the car’s telematics unit 
and a remote Telematics Call Center (TCC) operated by 
the manufacturer. In particular, the telematics unit incor- 
porates the aqLink code in its Gateway program which 
controls both voice and data cellular communication. 
Since a single cellular channel is used for both voice and 
data, a simple, in-band, tone-based signaling protocol is 
used to switch the call into data mode. The in-cabin audio 
is muted when data is transmitted, although a tell-tale 
light and audio announcement is used to indicate that a 
call is in progress. For pure data calls (e.g., telemetry and 
remote diagnostics), the unit employs a so-called “stealth” 
mode which does not provide any indication that a call 
is in progress. 
Reverse engineering the aqLink protocol. Reverse 
engineering the aqLink protocol was among the most 
demanding parts of our effort, in particular because it 
demanded signal processing skills not part of the typical 
reverse engineering repertoire. For pedagogical reasons, 
we briefly highlight the process of our investigation. 

We first identified an in-band tone used to initiate “data 
mode.” Having switched to data mode, aqLink provides 
a proprietary modulation scheme for encoding bits. By 
calling our car’s telematics unit (the phone number is 
available via caller ID), initiating data mode with a tone 
generator and recording the audio signal that resulted, 
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we established that the center frequency was roughly 
700 Hz and that the signal was consistent with a 400 bps 
frequency-shift keying (FSK) signal. 

We then used LD_PRELOAD on the telematics unit 
to interpose on the raw audio samples as they left the 
software modem. Using this improved signal source, we 
hunted for known values contained in the signal (e.g., 
unique identifiers stamped on the unit). We did so by en- 
coding these values as binary waveforms at hypothesized 
bitrates and cross-correlating them to the demodulated sig- 
nal until we were able to establish the correct parameters 
for demodulating digital bits from the raw analog signal. 

From individual bits, we then focused on packet 
structure. We were lucky to discover a debugging flag in 
the telematics software that would produce a binary log 
of all packet payloads transmitted or received, providing 
ground truth. Comparing this with the bitstream data, 
we discovered the details of the framing protocol (e.g., 
the use of half-width bits in the synchronization header) 
and were able to infer that data is sent in packets of up 
to 1024-bytes, divided into 22-byte frames which are 
divided into two 11-byte segments. We inferred that a 
CRC and ECC were both used to tolerate noise. Search- 
ing the disassembled code for known CRC constants 
quickly led us to determine the correct CRC to use, and 
the ECC code was identified in a similar fashion. For 
reverse-engineering the header contents, we interposed 
on the aqSend call (used to transmit messages), which 
allowed us to send arbitrary multi-frame packets and 
consequently infer the sequence number, multi-frame 
identifier, start of packet bit, ACK frame structure, etc. 
We omit the many other details due to space constraints. 

Given our derived protocol specification, we then 
implemented an aqLink-compatible software modem in C 
using a laptop with an Intel ICH3-based modem exposed 
as an ALSA sound device under Linux. We verified the 
modulation and formatting of our packet stream using 
the debugging log described earlier. 

Finally, layered on top of the aqLink modem is the 
telematics unit’s own proprietary command protocol that 
allows the TCC to retrieve information about the state of 
the car as well as to remotely actuate car functions. Once 
the Gateway program decodes a frame and identifies it as 
a command message, the data is then passed (via an RPC- 
like protocol) to another telematics unit program which 
is responsible for supervising overall telematics activities 
and implementing the command protocol (henceforth, 
the Command program). We reverse-engineered enough 
of the Gateway and Command programs to identify a 
candidate vulnerability, which we describe below. 
Vulnerabilities in the Gateway. As mentioned earlier, 
the aqLink code explicitly supports packet sizes up to 
1024 bytes. However, the custom code that glues aqLink 
to the Command program assumes that packets will never 
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exceed 100 bytes or so (presumably since well-formatted 
command messages are always smaller). This leads to 
another stack-based buffer overflow vulnerability that we 
verified is exploitable. Interestingly, because this attack 
takes place at the lowest level of the protocol stack, it com- 
pletely bypasses the higher-level authentication checks 
implemented by the Command program (since these 
checks themselves depend on being able to send packets). 
There is one key gap preventing this exploit from 
working in practice. Namely, the buffer overflow we 
chose to focus on requires sending over 300 bytes to 
the Gateway program. Since the aqLink protocol has 
a maximum effective throughput of about 21 bytes a 
second, in the best case, the attack requires about 14 
seconds to transmit. However, upon receiving a call, the 
Command program sends the caller an authentication 
request and, serendipitously, it requires a response within 
12 seconds or the connection is effectively terminated. 
Thus, we simply cannot send data fast enough over an 
unauthenticated link to overflow the vulnerable buffer. 
While we identified other candidate buffer overflows 
of slightly shorter length, we decided instead to focus on 
the authentication problem directly. 
Vulnerabilities in authentication. When a call is placed 
to the car and data mode is initiated, the first command 
message sent by the vehicle is a random, three byte au- 
thentication challenge packet and the Command program 
authentication timer is started. In normal operation, the 
TCC hashes the challenge along with a 64-bit pre-shared 
key to generate a response to the challenge. When 
waiting for an authentication response, the Command 
program will not “accept” any other packet (this does 
not prevent our buffer overflow, but does prevent sending 
other command messages). If an incorrect authentication 
response is received, or a response is not received within 
the prescribed time limit, the Command program will 
send an error packet. When this packet is acknowledged, 
the unit hangs up (and it is not possible to send any 
additional data until the error packet is acknowledged). 
After several failed attempts to derive the shared 
key, we examined code that generates authentication 
challenges and evaluates responses. Both contained errors 
that together were sufficient to construct a vulnerability. 
First, we noted that the “random” challenge implemen- 
tation is flawed. In most situations, this nonce is static and 
identical on the two cars we tested. The key flaw is that 
the random number generator is re-initialized whenever 
the telematics unit starts — such as when a call comes 
in after the car has been off — and it is seeded each time 
with the same constant. Therefore, multiple calls to a car 
while it is off result in the same expected response. Con- 
sequently, an attacker able to observe a response packet 
(e.g., via sniffing the cellular link during a TCC-initiated 
call) will be able to replay that response in the future. 
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The code parsing authentication responses has an even 
more egregious bug that permits circumvention without 
observing a correct response. In particular, there is a flaw 
such that for certain challenges (roughly one out of every 
256), carefully formatted but incorrect responses will be 
interpreted as valid. If the random number generation is 
not re-initialized (e.g., if the car is on when repeatedly 
called) then the challenge will change each time and | 
out of 256 trials will have the desired structure. Thus, 
after an average of 128 calls the authentication test can be 
bypassed, and we are able to transmit the exploit (again, 
without any indication to the driver). This attack is more 
challenging to accomplish when the car is turned off 
because the telematics unit can shut down when a call 
ends (hence re-initializing the random number generator) 
before a second call can reach it. 

To summarize, we identified several vulnerabilities in 
how our telematics unit uses the aqLink code that, to- 
gether, allow a remote exploit. Specifically, there is a 
discrepancy between the set of packet sizes supported by 
the aqLink software and the buffer allocated by the telem- 
atics client code. However, to exploit this vulnerability 
requires first authenticating in order to set the call timeout 
value long enough to deliver a sufficiently long payload. 
This is possible due to a logic flaw in the unit’s authenti- 
cation system that allows an attacker to blindly satisfy the 
authentication challenge after approximately 128 calls. 
Concrete realization. We demonstrate and evaluate 
our attack in two concrete forms. First, we implemented 
an end-to-end attack in which a laptop running our 
custom aqLink-compatible software modem calls our 
car repeatedly until it authenticates, changes the timeout 
from 12 seconds to 60 seconds, and then re-calls our 
car and exploits the buffer overflow vulnerability we 
uncovered. The exploit then forces the telematics unit 
to download and execute additional payload code from 
the Internet using the IP-addressable 3G data capability. 

We also found that the entire attack can be imple- 
mented in a completely blind fashion— without any 
capacity to listen to the car’s responses. Demonstrating 
this, we encoded an audio file with the modulated 
post-authentication exploit payload and loaded that file 
onto an iPod. By manually dialing our car on an office 
phone and then playing this “song” into the phone’s 
microphone, we are able to achieve the same results and 
compromise the car. 


5 Remote Exploit Control 

Thus far we have described the external attack surface 
of an automobile and demonstrated the presence of 
vulnerabilities in a range of different external channels. 
An adversary could use such means to compromise 
a vehicle’s systems and install code that takes action 
immediately (e.g., unlocking doors) or in response to 
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some environmental trigger (e.g., the time of day, speed, 
or location as exported via the onboard GPS). 

However, the presence of wireless channels in the 
modern vehicle qualitatively changes the range of options 
available to the adversary, allowing actions to be remotely 
triggered on demand, synchronized across multiple 
vehicles, or interactively controlled. Further, two-way 
channels permit both remote monitoring and data exfil- 
tration. In this section, we broadly evaluate the potential 
for such post-compromise control, characterize these 
capabilities, and evaluate the capabilities via prototype 
implementations for TPMS, Bluetooth, FM RDS and Cel- 
lular channels. Our prototype attack code is delivered by 
exploiting one of the previously described vulnerabilities 
(indeed, any exploit would work). Table 2 summarizes 
these results, again with our assessment of the effort 
required to discover and implement the capability. 
TPMS. We constructed two versions of a TPMS-based 
triggering channel. One installs code on another ECU 
(the telematics ECU in our case, although any ECU would 
do) that monitors tire pressure signals as the TPMS ECU 
broadcasts them over the CAN bus. The presence of a 
particular tire pressure reading then triggers the payload; 
the trigger tire pressure value is not expected to be found 
in the wild but must instead be adversarially transmitted 
over the air. For our second example, the attack reflashes 
the TPMS ECU via CAN and installs code onto it that will 
detect specific wireless trigger packets and, if detected, 
will send pre-programmed CAN packets directly over the 
car’s internal network. Both attacks required a custom 
TPMS packet generator (described below). The latter 
attack also required significant reverse engineering efforts 
(e.g., we had to write a custom IDA Pro module for 
disassembling the firmware, and we were highly memory 
constrained, so that the resulting attack firmware— 
hand-written object code — needed to re-use code space 
originally allocated for CRC verification, the removal of 
which did not impair the normal TPMS functionality). 

To experimentally verify these triggers, we reverse- 
engineered the 315 MHz TPMS modulation and framing 
protocol (far simpler than the aqLink modem) and then im- 
plemented a USRP software radio module that generates 
the appropriate wireless signals to activate the triggers. 
Bluetooth. We modified the Bluetooth exploit code on 
the telematics ECU to pair, post compromise, with a 
special MAC address used by the adversary and accept 
her commands (either triggering existing functionality 
or receiving new functionality). We did not explore 
exfiltrating data via the two-way Bluetooth channel, but 
we see no reason why it would not be possible. 

FM RDS. Using the CD-based firmware update attack 
we developed earlier, we reflashed the media player ECU 
to send a pre-determined set of CAN packets (our pay- 
load) when a particular “Program Service Name” message 
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arrives over the FM RDS channel. We experimentally 
verified this with a low-power FM transmitter driven by 
a Pira32 RDS encoder; an attacker could communicate 
over much longer ranges using higher power. Table 2 lists 
the cost for this attack as medium given the complexity 
of programming/debugging in the media player execution 
environment (we bricked numerous CD players before 
finalizing our implementation and testing on our car). 
Cellular. We modified our telematics exploit payload 
to download and run a small (400 lines of C code) IRC 
client post-compromise. The IRC client uses the vehicle’s 
high bandwidth 3G data channel to connect to an IRC 
server of our choosing, self-identifies, and then listens 
for commands. Subsequently, any commands sent to this 
IRC server (from any Internet connected host) are in turn 
transmitted to the vehicle, parsed by the IRC client, and 
then transmitted as CAN packets over the appropriate 
bus. We further provided functionality to use this channel 
in both a broadcast mode (where all vehicles subscribed 
to the channel respond to the commands) or selectively 
(where commands are only accepted by the particular 
vehicle specified in the command). For the former, we 
experimentally verified this by compromising two cars 
(located over 1,000 miles apart), having them both join 
the IRC channel, and then both simultaneously respond 
to a single command (for safety, the command we sent 
simply made the audio systems on both cars chime). 
Finally, the high-bandwidth nature (up to 1 Mbps at 
times) of this channel makes it easy to exfiltrate data. (No 
special software is needed since ftp is provided on the 
host platform.) To make this concrete we modified our 
attack code for two demonstrations: one that periodically 
“tweets” the GPS location of our vehicle and another that 
records cabin audio conversations and sends the recorded 
data to our servers over the Internet. 


6 Threat Assessment 
Thus far we have considered threats primarily at a 
technical level. Previously, we have shown that gaining 
access to a car’s internal network provides sufficient 
means for compromising all of its systems (including 
lights, brakes, and engine) [14]. In this paper, we have 
further demonstrated that an adversary has a practical 
opportunity to effect this compromise (i.e., via a range 
of external communications channels) without having 
physical access to the vehicle. However, real threats 
ultimately have some motive as well: a more concrete 
goal that is achieved by exploiting the capability to attack. 
This leaves unanswered the crucial question: Just how 
serious are the threats? Obviously, there are no clear 
ways to predict such things, especially in the absence 
of any known attacks in the wild. However, we can 
reason about how the capabilities we have identified 
can be combined in service to known goals. While one 
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Channel Range Implemented Control / Trigger Exfiltration Cost 

TPMS (tire pressure) Short Predefined tire pressure sequences causes telematics unit to send CAN No Low-Medium 
packets 

TPMS (tire pressure) Short TPMS trigger causes TPMS receiver to send CAN packets No Medium 

Bluetooth Short Presence of trigger MAC addresses allows remote control Yes* Low 

FM radio (RDS channel) Long FM RDS trigger causes radio to send CAN packets No Medium 

Cellular Global IRC command-and-control (botnet) channel allows broadcast and Yes Low 


single-vehicle control 


Table 2: Implemented control and trigger channels. The Cost column captures the approximate effort to develop this 
post-compromise control capability. The Exfiltration column indicates whether this channel can also be used to exfiltrate 
data. For (*), we did not experimentally verify data exfiltration over Bluetooth. 


can easily envision hypothetical “cyber war’ or terrorist 
scenarios (e.g., infect large numbers of cars en masse 
via war dialing or a popular audio file and then, later, 
trigger them to simultaneously disengage the brakes 
when driving at high speed), our lack of experience with 
such concerns means such threats are highly speculative. 

Instead, to gauge whether these threats create practical 

risks, we consider (briefly) how the raw capabilities we 
have identified might affect two scenarios closer to our 
experience: financially motivated theft and third-party 
surveillance. 
Theft. Using any of our implemented exploit capabilities 
(CD, PassThru, Bluetooth, and cellular), it is simple 
to command a car to unlock its doors on demand, thus 
enabling theft. However, a more visionary car thief 
might realize that blind, remote compromise can be used 
to change both scale and, ultimately, business model. 
For example, instead of attacking a particular target 
car, the thief might instead try to compromise as many 
cars as possible (e.g., by war dialing). As part of this 
compromise, he might command each car to contact a 
central server and report back its GPS coordinates and 
Vehicle Identification Number (VIN). The IRC network 
described in Section 5 provides just this capability. The 
VIN in turn encodes the year, make and model of each car 
and hence its value. Putting these capabilities together, 
a car thief could “sift” through the set of cars, identify 
the valuable ones, find their location (and perhaps how 
long they have been parked) and, upon visiting a target 
of interest then issue commands to unlock the doors and 
so on. An enterprising thief might stop stealing cars 
himself, and instead sell his capabilities as a “service’ 
to other thieves (“I’m looking for late model BMWs or 
Audis within a half mile of 4th and Broadway. Do you 
have anything for me?”) Careful readers may notice 
that this progression mirrors the evolution of desktop 
computer compromises: from individual attacks, to mass 
exploitation via worms and viruses, to third-party markets 
selling compromised hosts as a service. 

While the scenario itself is today hypothetical, we have 
evaluated a complete attack whereby a thief remotely 
disables a car’s security measures, allowing a unskilled 
accomplice to enter the car and drive it away. Our attack 
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directs the car’s compromised telematics unit to unlock 
the doors, start the engine, disengage the shift lock 
solenoid (which normally prevents the car from shifting 
out of park without the key present), and spoof packets 
used in the car’s startup protocol (thereby bypassing the 
existing immobilizer anti-theft measures!'). We have 
implemented this attack on our car. In our experiments 
the accomplice only drove the “stolen” car forward and 
backward because we did not want to break the steering 
column lock, though numerous online videos demonstrate 
how to do so using a screwdriver. (Other vehicles have 
the steering column lock under computer control.) 
Surveillance. We have found that an attacker who has 
compromised our car’s telematics unit can record data 
from the in-cabin microphone (normally reserved for 
hands-free calling) and exfiltrate that data over the con- 
nected IRC channel. Moreover, as said before, it is easy 
to capture the location of the car at all times and hence 
track where the driver goes. These capabilities, which 
we have experimentally evaluated, could prove useful 
to private investigators, corporate spies, paparazzi, and 
others seeking to eavesdrop on the private conversations 
within particular vehicles. Moreover, if the target vehicle 
is not known, the mass compromise techniques described 
in the theft scenario can also be brought to bear on this 
problem. For example, someone wishing to eavesdrop 
on Google executives might filter a set of compromised 
cars down to those that are both expensive and located in 
the Google parking lot at 10 a.m. The location of those 
same cars at 7 p.m. is likely to be the driver’s residence, 
allowing the attacker to identify the driver (e.g., via com- 
mercial credit records). We suspect that one could identify 
promising targets for eavesdropping quite quickly in this 
manner. 


7 Discussion and Synthesis 

Our research provides us with new insights into the 
risks with modern automotive computing systems. We 
begin here with a discussion of concrete directions for 
increasing security. We then turn to our now broadly 


'lPast work on bypassing immobilizers required prior direct or in- 
direct access to the car’s keys, e.g., Bono et al. [2] and Francillon 
et al. [9]. 
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informed reflections on why vulnerabilities exist today 
and the challenges in mitigating them. 


7.1 Implementation fixes 

Our concrete, near-term recommendations fall into two 
familiar categories: restrict access and improve code ro- 
bustness. Given the high interconnectedness of car ECUs 
necessary for desired functionality, the solution is not to 
simply remove or harden individual components (e.g., the 
telematics unit) or create physically isolated subnetworks. 

We were surprised at the extent to which the car’s 
externally facing interfaces were open to unsolicited 
communications — thereby broadening the attack surface 
significantly. Indeed, very simple actions, such as not 
allowing Bluetooth pairing attempts without the driver’s 
first manually placing the vehicle in pairing mode, would 
have undermined our ability to exploit the vulnerability 
in the underlying Bluetooth code. Similarly, we believe 
the cellular interface could be significantly hardened by 
using inbound calls only to “wake up” the car (i.e., never 
for data transfer) and having the car itself periodically 
dial out for requests while it is active. Finally, use of 
application-level authentication and encryption (e.g., 
via OpenSSL) in the PassThru device’s proprietary 
configuration protocol would have prevented its code 
from being exploited as well. 

However, rather than assume the attack surface will 
not be breached, the underlying code platform should 
be hardened as well. These include standard security 
engineering best-practices, such as not using unsafe 
functions like strcpy, diligent input validation, and 
checking function “contracts” at module boundaries. As 
an additional measure of protection against less-motivated 
adversaries, we recommend removing all debugging 
symbols and error strings from deployed ECU code. 

We also encourage the use of simple anti-exploitation 
mitigations such as stack cookies and ASLR that can 
be easily implemented even for simple processors and 
can significantly increase the exploit burden for poten- 
tial attackers. In the same vein, critical communications 
channels (e.g., Bluetooth and telematics) should have 
some amount of behavioral monitoring. The car should 
not allow arbitrary numbers of connection failures to go 
unanswered nor should outbound Internet connections to 
arbitrary destinations be allowed. In cases where ECUs 
communicate on multiple buses, they should only be al- 
lowed to be reflashed from the bus with the smallest ex- 
ternal attack surface. This does not stop all attacks where 
one compromised ECU affects an ECU on a bus with 
a smaller attack surface, but it does make such attacks 
more difficult. Finally, a number of the exploits we de- 
veloped were also facilitated by the services included in 
several units. For example, we made extensive use of 
telnetd, ftp, and vi, which were installed on the 
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PassThru and telematics devices. There is no reason for 
these extraneous binaries to exist in shipping ECUs, and 
they should be removed before deployment, as they make 
it easier to exploit additional connectivity to the plat- 
form. 

Finally, secure (authenticated and reliable) software 
updates must also be considered as part of automotive 
component design. 


7.2 Vulnerability drivers 

While the recommendations in Section 7.1 can signifi- 
cantly increase the security of modern cars against exter- 
nal attacks and post-compromise control, none of these 
ideas are new or innovative. Thus, perhaps the more inter- 
esting question is why they have not been applied in the 
automotive environment already. Our findings and subse- 
quent interactions with the automotive industry have given 
us a unique vantage point for answering this question. 

One clear reason is that automobiles have not yet been 
subjected to significant adversarial pressures. Tradition- 
ally automobiles have not been network-connected and 
thus manufacturers have not had to anticipate the actions 
of an external adversary; anyone who could get close 
enough to a car to modify its systems was also close 
enough to do significant damage through physical means. 
Our automotive systems now have broad connectivity; 
millions of cars on the road today can be directly 
addressed via cellular phones and via the Internet. 

This is similar to the evolution of desktop personal 
computer security during the early 1990s. In the same 
way that connecting PCs to the Internet exposed extant 
vulnerabilities that previously could not conveniently be 
exploited, so too does increasing the connectivity of auto- 
motive systems. This analogy suggests that, even though 
automotive attacks do not take place today, there is cause 
to take their potential seriously. Indeed, much of our work 
is motivated by a desire that the automotive manufacturers 
should not repeat the mistakes of the PC industry — wait- 
ing for high profile attacks before making security a top 
priority [18, 19]. We believe many of the lessons learned 
in hardening desktop systems (such as those suggested ear- 
lier) can be quickly re-purposed for the embedded context. 

However, our experimental vulnerability analyses also 
uncover an ecosystem for which high levels of assurance 
may be fundamentally challenging. Reflecting upon 
our discovered vulnerabilities, we noticed interesting 
similarities in where they occur. In particular, virtually 
all vulnerabilities emerged at the interface boundaries 
between code written by distinct organizations. 

Consider for example the Airbiquity software modem, 
which appears to have been delivered as a completed 
component. We found vulnerabilities not in the software 
modem itse/f but rather in the “glue” code calling it 
and binding it to other telematics functions. It was here 
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that the caller did not appear to fully understand the 
assumptions made by the component being called. 

We find this pattern repeatedly. The Bluetooth vulner- 
ability arose from a similar misunderstanding between 
the callers of the Bluetooth protocol stack library and its 
implementers (again in glue code). The PassThru vulnera- 
bility arose in script-based glue code that tried to interface 
a proprietary configuration protocol with standard Linux 
configuration scripts. Even the media player firmware 
update vulnerability appears to have arisen because the 
manufacturer was unaware of the vestigial CD-based 
reflashing capability implemented in the code base. 

While interface boundary problems are common in 
all kinds of software, we believe there are structural rea- 
sons that make them particularly likely in the automo- 
tive industry. In particular, the automotive industry has 
adopted an outsourcing approach to software that is quite 
similar to that used for mechanical components: supply 
a specification and contract for completed parts. Thus, 
for many components the manufacturer does not do the 
software development and is only responsible for integra- 
tion. We have found, for example, that different model 
years of ECUs with effectively the same functionality 
used completely different source code bases because they 
were provided by different suppliers. Indeed, we have 
come to understand that frequently manufacturers do not 
have access to the source code for the ECUs they con- 
tract for (and suppliers are hesitant to provide such code 
since this represents their key intellectual property ad- 
vantage over the manufacturer). Thus, while each sup- 
plier does unit testing (according to the specification) 
it is difficult for the manufacturer to evaluate security 
vulnerabilities that emerge at the integration stage. Tra- 
ditional kinds of automated analysis and code reviews 
cannot be applied and assumptions not embodied in the 
specifications are difficult to unravel. Therefore, while 
this outsourcing process might have been appropriate for 
purely mechanical systems, it is no longer appropriate for 
digital systems that have the potential for remote compro- 
mise, 

Developing security solutions compatible with the 
automotive ecosystem is challenging and we believe 
it will require more engagement between the computer 
security community and automotive manufacturers (in 
the same way that our community engages directly with 
the makers of PC software today). 


8 Conclusions 

A modern automobile is controlled by tens of distinct 
computers physically interconnected with each other via 
internal (wired) buses and thus exposed to one another. 
A non-trivial number of these components are also exter- 
nally accessible via a variety of I/O interfaces. Previous 
research showed that an adversary can seriously impact 
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the safety of a vehicle if he or she is capable of sending 
packets on the car’s internal wired network [14], and 
numerous other papers have discussed potential security 
risks with future (wired and wireless) automobiles in 
the abstract or on the bench [10, 15, 24, 26, 27, 28]. 
To the best of our knowledge, however, we are the 
first to experimentally and systematically study the 
externally-facing attack surface of a car. 

Our experimental analyses focus on a representative, 
moderately priced sedan. We iteratively refined an auto- 
motive threat model framework and implemented com- 
plete, end-to-end attacks along key points of this frame- 
work. For example, we can compromise the car’s ra- 
dio and upload custom firmware via a doctored CD, we 
can compromise the technicians’ PassThru devices and 
thereby compromise any car subsequently connected to 
the PassThru device, and we can call our car’s cellular 
phone number to obtain full control over the car’s telem- 
atics unit over an arbitrary distance. Being able to com- 
promise a car’s ECU is, however, only half the story: The 
remaining concern is what an attacker is able to do with 
those capabilities. In fact, we show that a car’s externally- 
facing I/O interfaces can be used post-compromise to 
remotely trigger or control arbitrary vehicular functions 
at a distance and to exfiltrate data such as vehicle lo- 
cation and cabin audio. Finally, we consider concrete, 
financially-motivated scenarios under which an attacker 
might leverage the capabilities we develop in this pa- 
per. 

Our experimental results give us the unique oppor- 
tunity to reflect on the security and privacy risks with 
modern automobiles. We synthesize concrete, pragmatic 
recommendations for future automotive security, as well 
as identify fundamental challenges. We disclosed our 
results to relevant industry and government stakeholders. 
While defending against known vulnerabilities does not 
imply the non-existence of other vulnerabilities, many 
of the specific vulnerabilities identified in this paper have 
or will soon be addressed. 
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Abstract 


We present DECODE, a system for recovering information 
from phones with unknown storage formats, a critical 
problem for forensic triage. Because phones have myr- 
iad custom hardware and software, we examine only the 
stored data. Via flexible descriptions of typical data struc- 
tures, and using a classic dynamic programming algo- 
rithm, we are able to identify call logs and address book 
entries in phones across varied models and manufactur- 
ers. We designed DECODE by examining the formats of 
one set of phone models, and we evaluate its performance 
on other models. Overall, we are able to obtain high 
performance for these unexamined models: an average 
recall of 97% and precision of 80% for call logs; and 
average recall of 93% and precision of 52% for address 
books. Moreover, at the expense of recall dropping to 
14%, we can increase precision of address book recovery 
to 94% by culling results that don’t match between call 
logs and address book entries on the same phone. 


1 Introduction 


When criminal investigators search a location and seize 
computers and other artifacts, a race begins to locate off- 
site evidence. Not long after a search warrant is executed, 
accomplices will erase evidence; logs at cellular providers, 
ISPs, and web servers will be rotated out of existence; and 
leads will be lost. Moreover, investigators make the most 
progress during on-scene interviews of suspects if they are 
able to ask about on-scene evidence. Mobile phones are of 
particular interest to investigators. Address book entries 
and call logs contain valuable information that can be used 
to construct a timeline, compile a list of accomplices, or 
demonstrate intent. Further, phone numbers can provide 
a link to a geographical location via billing records. For 
crimes involving drug trafficking, child exploitation, and 
homicide, these leads are critical [17]. 

The process of quickly acquiring important evidence 
on-scene in a limited but accurate fashion is called foren- 
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sic triage [16]. Unfortunately, digital forensics is a time- 
consuming task, and once computers are seized and sent 
off site, examination results are returned after a months- 
long work queue. Getting partial results on-scene ensures 
certain leads and evidence are recovered sooner. 

Forensic triage is harder for phones than desktop com- 
puters. While the Windows/Intel platform vastly domi- 
nates desktops, the mobile phone market is based on more 
than ten operating systems and more than ten platform 
manufacturers making use of an unending introduction 
of custom hardware. In 2010, 1.6 billion new phones 
were sold [15], with billions of used phones still in use. 
Smart phones, representing only 20% of new phones [15], 
store information from thousands of applications each 
with potentially custom data formats. The more popu- 
lar feature phones, while simpler devices, are quick to 
be released and replaced by new models with different 
storage formats. Both types of phones are problematic as 
phone application, OS, and file system specifications are 
closely guarded as commercial secrets. Companies do not 
typically release information required for correct parsing. 

Assuming the phone is not locked by the user, the 
easiest method of phone triage is to simply flip through the 
phone’s interface for interesting information. This time- 
consuming process can destroy the integrity of evidence, 
as there is no guarantee data will not be modified during 
the browse. Similarly, backups of the phone may be 
examined, but neither backups nor manual browsing will 
recover deleted data and data otherwise hidden by the 
phone’s interface. Hidden data can include metadata, such 
as timestamps and flags, that can demonstrate a timeline 
and user intent, both of which can be critical for the legal 
process. 

Forensic investigation begins with data acquisition and 
the parsing of raw data into information. The challenge 
of phones and embedded systems is that too often the 
exact data format used on the device has never been seen 
before. Hence, a manual process of reverse engineering 
begins — a dead-end for practitioners. Recent research on 
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automated reverse engineering is largely focused on the in- 
strumentation of the system and executables [1,6]. While 
accurate and reasonable for the common Windows/Intel 
desktop platform, construction of a new instrumentation 
system for every phone architecture-OS combination in 
use would require significant time for each and an exper- 
tise not present in the practitioner community. 


In this paper, we focus on a data-driven approach to 
phone triage. We seek to quickly parse data from the 
phone without analyzing or instrumenting software. We 
aim to obtain high quality results, even for phones that 
have not been previously encountered by our system. Our 
solution, called DECODE, leverages success from already 
examined phones in the form of a flexible library of prob- 
abilistic finite state machines. Our main insight is that the 
variety of phone models and data formats can be lever- 
aged for recovering information from new phones. We 
make three primary contributions: 


e We propose a method of block hash filtering for re- 
vealing the most interesting blocks within a large 
store on a phone. We compare small blocks of un- 
parsed data from a target phone to a library of known 
hashes. Collisions represent blocks that contain con- 
tent common to the other phones, and therefore not 
artifacts specific to the user, e.g., phone numbers 
or call log entries. Our methods work in seconds, 
reducing acquired data by 69% on average, without 
removing usable information. 


e To recover information from the remaining data, we 
adapt techniques from natural language processing. 
We propose an efficient and flexible use of probabilis- 
tic finite state machines (PFSMs) to encode typical 
data structures. We use the created PFSMs along 
with a classic dynamic programming algorithm to 
find the maximum likelihood parse of the phone’s 
memory. 


e We provide an extensive empirical evaluation of our 
system and its ability to perform well on a large 
variety of previously unexamined phone models. We 
apply our PFSM set — unmodified — to six other 
phone models from Nokia, Motorola, and Samsung 
and show that our methods are able to recover call 
logs with 97% recall and 80% precision and address 
books with 93% recall and 52% precision for this set 
of unseen models. 


There are a series of commercial products that parse 
data from phones (e.g., .XRY, cellebrite, and Paraben). 
However, these products rely on slow, manual reverse 
engineering for each phone model. Moreover, none of 
these products will attempt to parse data for previously 
unseen phone models. Even the collection of all such 
products does not cover all phone models currently on the 
market, and certainly not the set of all models still in use. 


20th USENIX Security Symposium 


In contrast, we design and evaluate a general approach 
for automatically recovering information on previously 
unseen devices, one that leverages information from past 
success. 


2 Methodology and Assumptions 


Our goal is to enable triage-based data recovery for mobile 
phones during criminal investigations. Below, we provide 
a definition of triage, our problem, and our assumptions. 
Unlike much related work, our focus is not on incident 
response, malware analysis, privilege escalation, protocol 
analysis, or other topics related to security primitives. We 
aim to have an impact on any crime where a phone may 
be carried by the perpetrator before the crime, held during 
the crime, used as part of the crime or to record the crime 
(e.g., a trophy photo), or used after the crime. 


The triage process. The process of quickly acquiring im- 
portant evidence on-scene in a limited but accurate fash- 
ion is called forensic triage [16]. Our goals are focused 
on the law enforcement triage process, which begins with 
a search warrant issued upon probable cause, or one of the 
many lawful exceptions [12] to the Fourth Amendment 
(e.g., incidence to arrest). Law enforcement has several 
objectives when executing a search and performing triage. 
The first is locating all devices related to the crime so that 
no evidence is missed. The second is identifying devices 
that are not relevant to the crime so that they can be ig- 
nored, as every crime lab has a months-long backlog for 
completing forensic analysis. That delay is only exacer- 
bated by adding unneeded work. The third is interviewing 
suspects at the crime scene. These interviews are most 
effective when evidence found on-scene is presented to 
the interviewed person. Similarly, quickly determining 
leads for further investigation is critical so that evidence 
or persons do not disappear. Central to all of these ob- 
jectives is the ability to rapidly examine and extract vital 
information from a variety of devices, including mobile 
phones. 

Phone triage is not a replacement for gathering infor- 
mation directly from carriers; however, it can take several 
weeks to obtain information from a carrier. Moreover, 
carriers store only limited information about each phone. 
While most keep call logs for a year, other information is 
ephemeral. Text message content is kept for only about 
a week by Verizon and Sprint, and the IP address of a 
phone is kept for just a few days by AT&T [3]. In contrast, 
the same information is often kept by the phone indefi- 
nitely and, if deleted, it is still possibly recoverable using 
a forensic examination. 

The less time it takes to complete a triage of each de- 
vice, the more impact our techniques will have. While 
some crime scenes involve only a few devices, increas- 
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ingly crime scenes involve tens and potentially hundreds 
of devices. For example, an office can be the center of op- 
erations for a gang, organized crime unit, or para-military 
cell. Typically little time is available and, in the case 
of search warrants, restrictions are often in place on the 
duration of time that a location can be occupied by law en- 
forcement. In military scenarios, operations may involve 
deciding which, if any, of several persons and devices 
in a location should be brought back using the limited 
space in a vehicle; forensic triage is a common method of 
deciding. 


Problem definition. Our goal is to enable investigators 
to extract information quickly (e.g., in 20 minutes or less) 
from a phone, regardless of whether that exact phone 
model has been encountered before. We limit our results 
to information that is common to phones — address books 
and call logs — but is stored differently by each phone. 
Triage is not a replacement for a secondary, in-depth 
examination; but it does achieve shortened delay with a 
minimal reduction in recall and precision. Recall is the 
fraction of all records of interest that are recovered from a 
device; precision is the fraction of recovered records that 
are correctly parsed. 


Data acquisition. We make the following assumptions 
in the context of on-site extraction of information from 
embedded devices. The technical process of extracting 
a memory dump from a phone starts off very differently 
compared to laptops and desktops. Data on a phone is 
typically stored in custom solid state memory. These 
chips are typically soldered onto a custom motherboard, 
and data extraction without burning out the chip requires 
knowledge of pinouts. For that reason, several other meth- 
ods are in common use for extracting data. Broadly, data 
can be extracted representing either the Jogical or phys- 
ical layout of memory. Often these representations are 
referred to as the logical or physical image of a device, 
respectively. 

A logical image is typically easier to obtain and parse; 
however, it suffers from some serious limitations. First, it 
only contains information that is presented by the file sys- 
tem or other application interfaces. It omits deleted data, 
metadata about content, and the physical layout of data in 
memory (which we use in our parsing). Second, logical- 
extraction interfaces typically enforce access rules (e.g., 
preventing access to a locked phone) and may modify data 
or metadata upon access. Examples of logical extraction 
include using phone backup software or directly browsing 
through a phone using its graphical user interface. Due to 
the above deficiencies, our techniques operate directly on 
the physical image. 

A physical image contains the full layout of data stored 
in a phone’s memory, including deleted data that has not 
yet been overwritten; however, parsing raw data presents 
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a significant challenge to investigators — one our tech- 
niques attempt to address. We discuss the parsing chal- 
lenges further in Section 3.2. 

Physical extraction requires an interface that is below 
the phone’s OS or applications. There are a few different 
ways of acquiring a physical image. For example, some 
phones are compatible with flasher boxes [11], while oth- 
ers allow for extraction via a JTAG interface, or physical 
removal of the chip. Physical extraction typically takes 
between a few minutes and an hour depending on the 
extraction method, size of storage, and bus bandwidth. 
When we evaluate our techniques, we assume the prior 
ability to acquire the physical image of the phone. 

Numerous companies sell commercial products that 
acquire data from phones, both logically and physically. 
This acquisition process is easier than the recovery of 
information from raw data, though still a challenge and 
not one we address. Of course, we do not expect our 
methods to be used on phones for which the format of data 
is already known. But no company offers a product that 
addresses even a large portion of the phone market and 
no combination of products covers all possible phones, 
even among the market of phones still being sold. Used 
phones in place in the US and around the world number at 
least an order of magnitude larger than phones still being 
manufactured. 


Limitations of our threat model. We assume the owner 
of the phone has left data in a plaintext, custom format 
that is typical of how computers store information. We 
allow for encryption and even simple obfuscation, but we 
do not propose techniques that would defeat either. While 
this threat model is weak, it is representative of phone 
users involved in traditional crimes. Some smart-phones 
encrypt data, most do not; and almost all feature phones 
do not, and they represent 80% of the market [15]. Further, 
it is not possible for one attacker to encrypt the data of 
every other phone in existence, and our techniques work 
on all phones for which plaintext can be recovered. In 
other words, while we allow for any one person to encrypt 
their data, it does not significantly limit the impact of our 
results. 


3 Design of DECODE 


In this section, we provide a high-level overview of 
DECODE including its input, primary components, and 
output. 

DECODE takes the physical image of a mobile phone 
as input. We can think of the physical image as a stream 
of bytes with an unknown structure and no explicit de- 
limiters. DECODE filters and analyzes this byte stream 
to extract important information, presenting the output to 
the investigator. The internal process it uses is composed 
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Figure 1: An illustration of the DECODE’s process. Data ac- 
quired from a phone is passed first through a filtering mechanism 
based on hash sets of other phones. The remaining data is input 
to a multistep inference component, largely based on a set of 
PFSMs. The output is a set of records representing information 
found on the phone. The PFSMs can optionally be updated to 
improve the process. 


of two components, illustrated in Fig. 1: (i) block hash 
filtering and (ii) inference. 

DECODE uses the block hash filter to exclude sub- 
sequences of bytes that do not contain information of 
interest to investigators. The primary purpose of this fil- 
tering is to reduce the amount of data that needs to be 
examined and therefore increase the speed of the system. 

DECODE parses the filtered byte stream to extract in- 
formation first in the form of fields and then as records. 
Fields are the basic unit of information and they in- 
clude data types such as phone numbers and timestamps. 
Records are groups of semantically related fields that con- 
tain evidence of interest to investigators, e.g., address 
book entries. The inference component is designed to 
be both extensible and flexible, allowing an investigator 
to iteratively refine rules and improve results when time 
allows. 


3.1. Block Hash Filtering 


DECODE’s block hash filtering component (BHF) is 
based on the notion that long identical byte sequences 
found on different phones are unnecessary for triage. That 
is, such sequences are unlikely to contain useful informa- 
tion for investigators. Mobile phones use a portion of 
their physical memory to store operating system software 
and other data that have limited utility for triage. BHF 
is designed to remove this cruft and reduce the number 
of bytes that needs to be analyzed, thereby increasing the 
speed of the system. 


Description. DECODE’s block hash filter logically di- 
vides the input byte stream into small subsequences of 
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Figure 2: Block hash filtering takes a stream of n bytes and 
creates a series of overlapping blocks of length b. The start of 
each block differs by d < b bytes. Any collision of the hash of 
a block with a block on another phone (or the same phone) is 
filtered out. 


bytes. We refer to each of these subsequences as a block. 
DECODE filters out a block if its hash value matches a 
value in a library of hashes computed from other phones. 
Blocks may repeat within the same phone, but only the 
first occurrence of each block remains after filtering. 
DECODE uses block hashes, rather than a direct byte com- 
parison, to improve system performance; However, BHF 
may lead to erroneous filtering due to block collisions. 
One type of collision arises when blocks with different 
byte sequences share the same hash value. Another type 
of collision occurs when blocks share the same subse- 
quence even though they actually contain user informa- 
tion. Currently, DECODE mitigates the risk of collisions 
by using a cryptographic hash function and a sufficiently 
large block size. 

To make the filter more resilient to small perturbations 
in byte/block alignment, DECODE uses a sliding window 
technique with overlap between the bytes of consecutive 
blocks [22]. In other words, the last bytes of a block are 
the same as the first bytes of the next block. 

More formally, DECODE logically divides an input 
stream of n bytes, into blocks of b bytes with a shift 
of d < b bytes between the start of successive blocks. 
The SHA-1 hash value for each block is computed and 
compared to the hash library. DECODE filters out all 
matched blocks. Fig 2 illustrates a simple example. 

As we show empirically in Section 5, nearly all of the 
benefit of block hash filtering can be realized by just using 
another phone of the same make and model. This result 
ensures BHF is scalable as the test phone need not be 
compared to all phones in an investigator’s library. 

The general idea of our block hash filter is similar 
to work by a variety of researchers in a number of do- 
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Q042006F0062000B0B01000300000B1972642866600008130207D603070F1A17 


Unicode 





11-digit phone number Timestamp 


Figure 3: A simplified example of raw data as stored by a Nokia 
model phone, labeled with the correct interpretation. DECODE 
outputs a call log: the Unicode string “Bob”; the phone number 
(OxB digits long and null terminated) 1-972-642-8666; and the 
timestamp 3/7/2006 3:26:23 PM. 


mains [9, 13,22]. Our primary contribution is the empiri- 
cal analysis of the technique in the phone domain. Further 
discussion of related work is given in Section 6. 


3.2. Inference 


After block hash filtering has been performed, what re- 
mains is a reduced ad hoc data source about which we 
have only minimal information. Our goal is to identify cer- 
tain types of structured information, such as phone num- 
bers, names, and other data types embedded in streams of 
this data. 

Parsing phones is particularly challenging due to the 
inherent ambiguity of the input byte stream. Along with 
the lack of explicit delimiters, there is significant overlap 
between the encodings for different data structures. For 
example, certain sequences of bytes could be interpreted 
as both a valid phone number and a valid timestamp. For 
these reasons, simple techniques like the unix command 
strings and regular expressions will be mostly ineffec- 
tive. 

DECODE solves this ambiguity by using standard prob- 
abilistic parsing tools and a probabilistic model of encod- 
ings that might be seen in the data. DECODE obtains the 
maximum likelihood parse of the input stream creating 
a hierarchical description of information on the phone 
in the form of fields and records. More concretely, the 
output of DECODE is a set of call log and address book 
records. Each record is comprised of fields representing 
phone numbers, timestamps, strings, and other structures 
extracted from the raw stream. 


3.2.1 Fields and Records 


Within the block filtered data source, we have no infor- 
mation about where records or fields begin or end, and 
we have no explicit delimiters. Fig 3 shows simplified 
example data that could encode an address book entry 
in a Nokia phone; DECODE would receive this snippet 
embedded and undelineated in megabytes of other data. 
Unlike large objects, such as jpegs or Word docs, such 
small artifacts are difficult to isolate and can easily appear 
randomly. 

To infer information found on phones, DECODE uses 
standard methods for probabilistic finite state machines 
(PFSMs), which we describe here. As implied above, 
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we have a lower level of field state machines that encode 
raw bytes as phone numbers, timestamps, and other types. 
We also have a higher level of record state machines that 
encode fields as call log entries and address book entries. 
For example, a call log record can be flexibly encoded 
as a phone number field and timestamp field very near to 
one another; the encoding might also include an optional 
text field. 

Each field’s PFSM consists of one or more states, in- 
cluding a set of start states and a set of end states. Each 
state has a given probability of transitioning to another 
state in the machine. Each state emits a single byte dur- 
ing each state transition of the PFSM. The emitted byte 
is governed by a probability distribution over the bytes 
from 0x00 to OxFF. Restricting the set of bytes that can 
be output by a state is achieved by setting the probability 
of those outputs to zero. For example, an ASCII alpha- 
betic state would only assign non-zero probabilities to 
the ASCII codes for “a” through “z” and “A” through 
“Z”. Every PFSM in DECODE’s set is targeted towards 
a specific data type. If correctly defined, a field’s PFSM 
will only accept a sequence of bytes if that sequence is a 
valid encoding of the field type. We constructed the field 
PFSMs based on past observations (see Section 4.1). 

Examples of DECODE’s specific field types include 
10-digit phone numbers, 7-digit phone numbers, Unicode 
strings, and ASCII strings. Each specific field is associ- 
ated with a generic field type such as text or phone number. 
Some fields have fixed lengths and others have arbitrary 
lengths. 

We define records in a similar manner. Records are 
represented as PFSMs, except that each state emits a 
generic field rather than a raw byte. 

Given the set of PFSMs representing each field type 
that we have encoded, we then aggregate them all into a 
single Field PFSM. We separately aggregate all record PF- 
SMs into a single Record PFSM. The aggregation naively 
creates transitions from every field’s end state to every 
other field’s start states with some probability, and we do 
the same for compiling records. (We discuss setting these 
probabilities below.) In the end, we have two distinct 
PFSMs that are used as input to our system, along with 
data from a phone. 


3.2.2 Finding the maximum likelihood sequence of 
states 


Our basic challenge is that, for a given phone byte stream 
that is passed to the inference component of DECODE, 
there will be many possible way to parse the data. That is, 
there are many ways the PSFMs could have created the ob- 
served data, but some of these are more likely than others 
given the state transitions and the output probabilities. To 
formalize the problem, let B = bo, b1,..., by be the stream 
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of n bytes from the data source. Let S = s0, $1,..., 5n 
be a sequence of states which could have generated the 
output bytes. Our goal then, is to find 

sbn), () 


arg max P(s0, $1, ..-; 8n|bo, b1, -.- 


i.e., the maximum probability sequence of states given 
the observed bytes. These states are chosen from the set 
encoded in the PFSM given to DECODE. The probabilities 
assigned to PFSM’s states, transitions, and emissions 
affect the specific value that satisfies the above equation. 

In a typical hidden Markov model, one assumes that 
an output byte is a function only of the current unknown 
state, and that given this state, the current output is inde- 
pendent of all other states and outputs. Using this assump- 
tion, and noting that multiplying the above expression by 


P(bo, ..-, bn) does not change the state sequence which 
maximizes the expression, we can write 
arg max P(s9,..., 8n|bo, .--, On) 
805-+58n 
= argmax P(s9,..., n|Do, ---, bn) P(bo, -.-, bn) 
SOs-e+> Sn 
= argmax P(s9,..., 8n,bo,..-,0n) 
S05-+-,5n 
= argmax P(s9,...,$n)P(0o, -.-, On|S0,; ---; Sn) 


n 


.8n) |] P(bils:)- (2) 


1=0 


Naively enumerating all possible state sequences and se- 
lecting the best parse is at best inefficient and at worst 
intractable. One way around this is to assume that the 
current state depends only on the state that came immedi- 
ately before it, and is independent of other states further 
in the past. This is known as a first order Markov model, 
and allows us to write 


n 


arg max P(s0,..., §n) [[ Pils: 


SO 5+++,5n i=0 
= argmax P(sq)P(s1|80)P(s2|S80, $1) 


.--P(8n|80, 815 «+; 8n—1) | | P(bil si) 
1=0 
arg max P(s9) II P(s;|8;-1) Il P(b;|s;). () 


ee ; : 
Osten i=1 i=0 


l| 


The Viterbi algorithm is an efficient algorithm for finding 
the state sequence that maximizes the above expression. 
The complexity of the Viterbi algorithm is O(nk?) where 
n and k are the number of bytes and states. For a full 
explanation of the algorithm, see for example the texts by 
Viterbi [24] or Russell and Norvig [21]. 
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3.2.3 Fixed length fields and records 


Markov models are well-suited to data streams with arbi- 
trary length fields. For example, an arbitrary length text 
string can be modeled well by a single state that might 
transition to itself with probability a, or with probabil- 
ity 1 — a to some other state, and hence terminating the 
string. Unfortunately, first order Markov models are not 
well-suited to modeling fields with fixed lengths (like 7- 
digit phone numbers), since it is impossible to enforce the 
transition to a new state after 7 bytes when one is only con- 
ditioning state transition on a single past state. In other 
words, a first order Markov model cannot “remember” 
how long it has been in a particular state. 
Since it is critical for us to model certain fixed length 
fields like dates and phone numbers, we had two options: 
e Adda separate new state for every position in a fixed 
length field. For example, a 7-digit phone number 
would have seven different separate states, rather 
than a single state. 


e Implement an mth order Markov model, where m is 
equal to the length of the longest fixed length field 
we wish to model. 

The first option, under a naive implementation, leads to a 
very large number of states, and since Viterbi is O(nk?), 
it leads to impractical run times. 

The second option, using an mth order Markov model, 
keeps the number of states low, but can also lead to very 
large run times of O(nk™*+). However, by taking advan- 
tage of the fact that most state transitions in our model 
only depend upon a single previous state, and other struc- 
ture in our problem, we are able to implement Viterbi, 
even for our fixed length fields, in time that is close to 
the implementation for a first order Markov model with a 
small number of states. Similar techniques have been used 
in the language modeling literature to develop efficient 
higher-order Markov models [14]. 


3.2.4 Hierarchical Viterbi 


DECODE uses Viterbi twice. First, it passes the filtered 
byte stream to Viterbi with the Field PFSM as input. The 
output of the first pass is the most likely sequence of 
generic fields associated with the byte stream. That field 
sequence is then input to Viterbi along with the Record 
PFSM for a second pass. We refer to these two phases as 
field-level and record-level inference, respectively. 

The hierarchical composition of records from fields 
(which are in turn composed of bytes) can be captured 
by a variety of statistical models, including context free 
grammars. The main reason we chose to run Viterbi in 
this hierarchical fashion, rather than integrating the infor- 
mation about a phone type in something like a context free 
grammar, was to limit the explosion of states. In particu- 
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lar, because we have a variety of fixed-length field types, 
such as phone numbers, the number of states required 
to implement a seamless hierarchical model would grow 
impractically large. Our resulting inference algorithms 
would not have practical run times. 

The decomposition of our inference into a field-level 
stage and a record-level stage makes the computations 
practical at a minimal loss in modeling power. The reason 
that DECODE can operate on phones that are unseen is 
that record state machines are very general. For example, 
we don’t require timestamps to phone numbers to appear 
in any specific order for a call log entry. We require only 
that they are both present. 


3.2.5 Post-processing 


The last stage of our inference process takes the set of 
records recovered by Viterbi and passes them through a 
decision tree classifier to remove potential false positives. 
We refer to this step as post-processing. We use a decision 
tree classifier because it able to take into account features 
that can be inefficient to encode in Viterbi. For example, 
our classifier considers whether a record was found in 
isolation in the byte stream, or in close proximity to other 
records. In the former case, the record is more likely to be 
a false positive. Our evaluation results (Section 5) show 
that this process results in significant improvements to 
precision with a negligible effect on recall. 

We use the Weka J48 Decision Tree, an open source im- 


plementation of a well-known classifier (http://www. 


cs.waikato.ac.nz/ml/weka). In general, a deci- 
sion tree can be used to decide whether or not an input 
is an example of the target class for which it is trained. 
The classifier is trained using a set of feature tuples rep- 
resenting both positive and negative examples. In our 
case, the decision tree decides whether a given record, 
output from our Viterbi stage, is valid or not. We selected 
a set of features common to both call log and address 
book records: number of days from the average date; fre- 
quency of phone numbers with same area code; number 
of different forms seen for the same number (e.g., 7-digit 
and 10-digit); number of characters in string; number of 
times the record appears in memory; distance to closest 
neighbor record. We do not claim that our choice of fea- 
tures and classifier is optimal; it merely represents a lower 
bound for what is possible. 

Post-processing does not inhibit the investigator, it is 
a filter intended to make the investigator’s work easier. 
To this end, DECODE can make both the pre- and post- 
processing results available ensuring that the investigator 
has as much useful information as possible. 

For our evaluation, the positive training examples con- 
sisted of true records from a small set of phones called 
our development set (described in detail in Section 5). 
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Generic type | Specific type Num. 
States 
Records 

Nokia call log 8 

composed of text, phone num., timestamps 

General call lo 5 
eaeee composed of er phone num., timestamps 

General address book 3 
Address books | composed of phone numbers, text 

Fields 

ASCII 11 
Phone number | Unicode 22 

Nokia 10 digit 6 

UNIX 4 
Timestamp Samsung 4 

Nokia 7 

ASCII bigram 6 
ae Unicode 7 
Number index | Nokia number index 1 
unstructured unstructured 1 




















Table 1: Examples of types that we have defined in DECODE. 


To create the negative training examples, we used a 10 
megabyte stream of random data with byte values selected 
uniformly at random from 0x00 to OxFF. We input the ran- 
dom data to DECODE’s Viterbi implementation and used 
the resulting output records as negative examples. We 
found that this provided better results than using negative 
examples found on real phones. 


4 Implementation of State Machines 


In the previous section, we presented DECODE’s design 
broadly; in this section, we focus on the core of the in- 
ference process: the probabilistic finite state machines 
(PFSM). 

DECODE’s PFSMs support a number of generic field 
types such as phone number, call log type, timestamp, 
and text as well as the target record types: address book 
and call log. Table | shows some example field types that 
we have defined and the number of states for each. In 
all, DECODE uses approximately 40 field-level and 10 
record-level PFSMs. 

Most fields emit fixed-length byte sequences. For ex- 
ample, the 10-digit phone number field is defined as 
10 states in which state k (for k # 1) can only be 
reached by state k — 1. The state machine for a 10- 
digit phone number as found on many Nokia phones is: 


( ) 1 /oi 1 WV dicts 6 1. f’bi 1 1 
gits Digits Digits Digits Digits 
{Length 1,2 3.4 5,6 7.8 ™\ 910 >| 


As mentioned in the previous section, each state emits 
a single byte; since Nokia often stores digits as nibbles, 
each state in the machine encodes two digits. The emis- 
sion probability is governed by both the semantics of the 
Nokia encoding and real-world constraints. For example 
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a 10-digit phone number (in the USA) cannot start with a 
0 or a | and therefore the first state in the machine cannot 
emit bytes 0x00-0x1F, i.e., the emission probability for 
each of these bytes is zero. 

Some fields, such as an unstructured byte stream 
have arbitrary length. Such a field is simply de- 
fined by a single state with probability a of transi- 
tioning to itself, and probability 1 — a of terminat- 
ing. In fact, this specific field is special: DECODE 
uses the unstructured field as a “catch-all” for unknown 
or unstructured portions of the byte stream. Byte se- 
quences that do not match a more meaningful field 
type will always match the unstructured field, which is: 





We emphasize that our goal is not to produce a full spec- 
ification of the format of a device. While we would cer- 
tainly be delighted if this were an easy problem to solve, 
we note that we can extract significant amounts of useful 
information from a data source even when large parts of 
the format specification are not understood. Hence, rather 
than solving the problem of complete format specification, 
we seek to extract as many records as possible according 
to our specification of records. It is also important to 
note that our field and record definitions may ignore large 
amounts of structure in a phone format. Only a minimal 
amount of information about a phone’s data organization 
is needed to define useful fields and records. We return 
this point in Section 5.3. 


4.1 Coding State Machines 


We created most of the PFSMs used in DECODE using 
a hex editor and manual reverse engineering on a small 
subset of phones that we denote as our development set. 
We limited the development set to one phone model each 
from four manufacturers with multiple instances of each 
model: the Nokia 3200B, Motorola v551, LG G4015 
and Samsung SGH-T309. We intentionally did not ex- 
amine any other phone models from these manufacturers 
prior to the evaluation of DECODE (Section 5) so that we 
could evaluate the effectiveness of our state machines on 
previously unobserved phone models. 

We also used DECODE itself to help refine and create 
new state machines, both field and record level, for the de- 
velopment phones. This process was very similar to how 
we imagine an investigator would use DECODE during 
the post-triage examination. 

Once we reached high recall for the development 
set, we fixed the PFSMs and other components using 
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DECODE without modification for the extent of our eval- 
uation regardless of what model was parsed. 


Selecting Transition Probabilities. A sequence of bytes 
may match multiple different field types. Similarly, a se- 
quence of fields may match multiple record types. Viterbi 
accounts for this by choosing the most likely type. It may 
appear that a large disadvantage of this approach is that 
we must manually set the type probabilities for both fields 
and records. However, Viterbi is robust to the choice of 
probabilities: the numerical values of the field probabil- 
ities are not as important as the probability of one field 
relative to another. 


5 Evaluation 


We evaluated DECODE by focusing on several key ques- 
tions. 


1. How much data does the block hash filtering tech- 
nique remove from processing? 


2. How effectively does our Viterbi-based inference 
process extract fields and records from the filtered 
data? 


3. How much does our post-processing stage improve 
the Viterbi-based results? 


4. How well does the inference process work on phones 
that were unobserved when the state machines were 
developed? 


Experimental Setup. We made use of a number of 
phones from a variety of manufacturers. The phones 
contained some GUI-accessible address book and call 
log entries, and we entered additional entries using each 
phone’s UI. A combination of deleted and GUI-accessible 
data was used in our tests; however, most phones con- 
tained only data that was deleted and therefore unavail- 
able from the phone’s interface but recoverable using 
DECODE. The phones we obtained were limited to those 
that we could acquire the physical image from memory 
(i.e., all data stored on the phone in its native form). The 
list of phones is given in Table 2. Our evaluation focuses 
on feature phones, i.e., phones with less capability than 
smart phones. 

As stated in Section 4.1, we performed all development 
of DECODE and its PFSMs using only the Nokia 3200B, 
Motorola v551, LG G4015, and Samsung SGH-T309 
phones. We kept the evaluation set of phones separate 
until ready to evaluate performance. We acquired the 
physical image for all phones using Micro Systemation’s 
commercial tool, .XRY. 

We focus on two types of records: address book en- 
tries and call log entries. We chose these record types 
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Make Model Count | MB 
PFSM Development Set 

Nokia 3200b 4 1.4 
Motorola | V551 2 | 32.0 
Samsung | SGH-T309 2 | 32.0 
LG G4015 2 | 48.0 
Evaluation Set 

Motorola | V400 2 | 32.0 
Motorola | V300 2 | 32.0 
Motorola | V600 2 | 32.0 
Motorola | V555 2 | 32.0 
Nokia 6170 2 4.9 
Samsung | SGH-X427M 2) 16.0 





Table 2: The phone models used in this study. The table shows 
the number we had of each and the size of local storage. 


because of their ubiquity across different phone models 
and their relative importance to investigators during triage. 
We evaluate the performance of DECODE’s inference en- 
gine based on two metrics, recall and precision. Recall is 
the fraction of all phone records that DECODE correctly 
identified: the number of true positives over the sum of 
false negatives and true positives. If recall is high, then 
all useful information on a phone has been found. Preci- 
sion is the fraction of extracted records that are correctly 
parsed: the number of true positives over the sum of false 
positives and true positives. If precision is high then the 
information output by DECODE is generally correct. 

Often these two metrics represent a trade-off, but our 
goal is to keep both high. In law enforcement, the relative 
importance of the two metrics depends on the context. For 
generating leads, recall is more important. For satisfying 
the probable cause standard required by a search warrant 
application, moderate precision is needed. Probable cause 
has been defined as “fair probability”! that the search 
warrant is justified, and courts do not use a set quantitative 
value. For evidence meeting the beyond a reasonable 
doubt standard needed for a criminal conviction, very 
high precision is required, though again no quantitative 
value can be cited. 

For each of our tested phones, we used .XRY not only 
to acquire the physical image, but also to obtain ground 
truth results that we used to compare against DECODE’s 
results. It was often the case that DECODE obtained re- 
sults that .XRY did not. And in those cases, we manually 
inspected the result and decided whether they were true 
or false positives (painstakingly using a hex editor). We 
made conservative decisions in this regard, but were able 
to employ a wealth of common sense rules. For exam- 
ple, if a call entry seemed to be valid and recent, but was 
several years from all other entries, we labeled it as a 
false positive. Similarly, an address book entry for “A.M.” 


'United States v. Sokolow, 490 U.S. 1 (1989) 
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is most reasonably assumed to be a true positive while 
““!Mb” is most reasonably a false positive; even though 
both have two letters and two symbols, the latter does not 
follow English conventions for punctuation. It would be 
impractical to program all such common sense rules and 
our manual checking is stronger in that regard. Occasion- 
ally, DECODE extracts partially correct or noisy records. 
We mark each of these records as wrong, unless the only 
error is a missing area code on the phone number. 


5.1 Block Hash Filtering Performance 


The goal of BHF is to reduce the amount of data that 
DECODE must parse, reducing run time, without sacri- 
ficing recall. On average, we find that BHF is able to 
filter out about 69% of the phone’s stored data without 
any measurable effect on inference recall. The BHF al- 
gorithm has only two parameters: the shift size d and the 
block size b. Our results show that the shift size does not 
greatly affect the algorithm’s performance, but it has a 
profound effect on storage requirements. Also, we found 
that performance varies with block size, but not as widely 
as expected. 

For each value of b and d that we tested, we kept the 
corresponding BHF sets in an SQL table. The database 
was able to match sets in tens of seconds, so we do not 
report run time performance results here. As an example, 
on a moderately resourceful desktop, DECODE is able to 
filter a 64 megabyte phone, with b = 1024 and d = 128, 
in under a minute. 

Ideally, we (and investigators) would want our hash 
library to be comprised entirely of new phones. If our 
library contains used phones, there is a negligible chance 
that the same common user data (e.g., an address book 
entry with the same name and number) will appear on 
different phones, align perfectly on block boundaries, and 
be erroneously filtered out. Regardless, it was impractical 
for us to find an untouched, new phone model for every 
phone we tested. If data was filtered out in this fashion 
because of our use of pre-owned phones, it would likely 
have shown up in the recall values in the next section; 
since the recall values are near perfect, we can infer this 
problem did not occur. 


Filtering Performance. First, we examined the effect of 
the block size b on filtering. Fig. 4 shows the overall filter 
percentage of our approach for varying block sizes. In 
these experiments, we set d = b so that there was never 
overlap. The line plots the average for all phones. As ex- 
pected, the smaller block sizes make more effective filters. 
However, a small block size results in more blocks and 
consequently, greater storage requirements. On average in 
our tests, 73% of data is filtered out when b = 256, while 
only slightly less, 69%, is filtered out when b = 1024. 
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Figure 4: The average performance of BHF as block size varies 
for all phones listed in Table 2 (logarithmic x-axis). Error bars 
represent one standard deviation. In all cases we set d = b (.e., 
shift size is equal to block size), but performance does not vary 
with d in general. 


Second, we examined the affect of the shift amount d 
on filtering. In our tests, we fixed b = 1024 and varied 
d={32, 64, 128, 256, 512, 1024}. However, there is less 
than a 1% difference in filtering between d = 32 and 
d = 1024 for all phones. (No plot is shown.) Again, the 
affect of dis on storage requirements, which we discuss 
below. 

Third, we isolated what type of data is filtered out 
for each phone using fixed block and shift sizes of b = 
1024 and d = 128; we use these values for all other 
experiments in this paper. Fig. 5 shows the results as 
stacked bars; the top graph shows filtering as a percentage 
of the data acquired from the phone, and the bottom graph 
shows the same results in megabytes. For each of the 
25 phones, the bottom (blue) bar shows the percentage 
of data filtered out because the block was a repeated, 
constant value (such as arun of zeros). The middle (black) 
bar shows the percentage of data that was in common with 
a different instance of the same make and model phone. 
The top red bar shows the percentage of data that can be 
filtered out because it is only found on some phone in the 
library that is a different make or model. The data that 
remains after filtering is shown in the top, white box. 

On average, 69% of data is removed by block hash 
filtering. Generally, the technique works well. On aver- 
age, half of the filtered out data was found on another 
phone of the same model. These percentage values are in 
terms of the complete memory, including blocks that were 
filled with constants (effectively empty). Therefore, as a 
percentage of non-empty data, the percentage of filtered 
out data is higher. These results suggest that it is often 
sufficient to only compare BHF sets of the same model 
phone. However, in some models less than 3% of data 
was found on another instance of the same model. This 
poor result was the case for the Samsung SGH-X427M 
and Motorola V300. Finally, the results shown in the 
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Figure 5: The amount of data remaining after filtering is shown 
as solid white bars, as a percentage (top) and in MB (bottom). 
On average, 69% of data is successfully filtered out. Black 
bars show data filtered out because they match data on another 
instance of the same model. Blue bars show data filtered out 
because it is a single value repeated (e.g., all zeros). Red bars 
show data filtered out because it appears on a different model. 
(b = 1024 bytes, d = 128 bytes) 


Fig. 5 (bottom), suggest that the performance of BHF was 
not correlated with the total storage space of the phone. 
Our results in the next section on inference, in which 
DECODE examines only data remaining after filtering, 
demonstrate that filtering does not significantly remove 
important information: recall is 93% or higher in all cases. 


Storage. An important advantage of our approach is that 
investigators can share the hash sets of phones, without 
sharing the data found within each phone. This sharing is 
very efficient as the hash sets are small compared to the 
phones. The number of blocks from each phone that must 
be hashed and stored in a library is O((n — b)/d), though 
only unique copies of each block need be stored. Given 
that n >> 6, the number of blocks is dependent on n and 
dand the affect of b on storage is insignificant. However, 
since it is required that d < b, the algorithm’s storage 
requirements does depend on b’s value in that sense. As 
an example, for a 64 megabyte phone, when b = 1024 
bytes and d = 128 bytes, the resulting BHF set is 524,281 
hash values. At 20-bytes each, the set is 10 megabytes 
(15% of the phone’s storage). Since we need perhaps only 
one or two examples of any phone model, the cumulative 
space needed to store BHF sets for an enormous number 
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Figure 6: Precision and recall for call logs. (Top) Results after 
only Viterbi parsing. (Bottom) Results after post-processing. 
Left bars are development set; right bars are evaluation set. In 
all graphs, black is recall and gray is precision. On average, 
development phones have recall of 98%, and precision of 69% 
that increases to 77% after post processing. On average, eval- 
uation phones have recall of 97%, and precision of 72% that 
increases to 80% after post processing. The T309 had no call 
log entries, which explains in part DECODE’s poor performance 
for the X427M. 


of phone models is practical. Since BHF gains nearly 
all benefit from comparing phones of the same model, 
comparison will always be fast. 

In order to be effective, the library needs to be con- 
structed using the same hash function and block size for 
all phones; however, the shift amount need not be the 
same. This is important because the storage requirement 
of the library is inversely proportional to the shift size and 
thus is minimized when d = b. Conversely, BHF removes 
the most data when d = 1. We can effectively achieve 
maximal filtering with minimal storage using d = b for 
the library and d = 1 for the test phone. The cost of this 
approach is more computation and consequently higher 
run times. A full analysis is beyond the scope of this 


paper. 


5.2 Inference Performance 


To evaluate our inference process, we used DECODE to 
recover call log and address book entries from a variety 
of phones. In our results, we distinguish between the 
performance of the Viterbi and decision tree portions of 
inference. Additionally, we make clear the performance 
of DECODE on phones in our development set versus 
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phones in our evaluation set. All results in this section 
assume that input is first processed using BHF. 

Fig. 6 shows the performance of our inference process 
for call logs; the top results are before the post-processing 
step and the bottom after post-processing. The white- 
space break in the chart separates the development set 
of phones (on the left), and the evaluation set (on the 
right). We put the most effort toward encoding high qual- 
ity PFSMs for the Nokia and Motorola phones. Not sur- 
prisingly, the results are best in general for these makes, 
indicating that the performance of DECODE is dependent 
on the quality of the PFSMs. However, the results also 
show that DECODE can perform well even for the previ- 
ously unseen phones in the evaluation set. Overall, recall 
of DECODE is near complete at 98% for development 
phones and 99% for evaluation phones. Precision is more 
challenging, and after Viterbi is at 69% for development 
phones and 72% for evaluation phones. It is important 
to note that no extra work on DECODE was performed 
to obtain results from the phones in the evaluation set, 
which is significant compared to methods that instrument 
executables or perform other machine and platform de- 
pendent analysis. After post-processing, the precision for 
the development and evaluation phones increased to 77% 
and 80% respectively. 

Fig. 7 shows the performance of our inference process 
for address book records. As before, the top results are 
after filtering but not post-processed while the bottom 
are post-processed. Overall, recall of the DECODE is 
again high at 99% for development phones and 93% for 
evaluation phones. Precision after Viterbi is 56% for de- 
velopment phones and 36% for evaluation phones. After 
post processing by the decision tree, the precision for all 
phones increased, by an average of 61% over the Viterbi- 
only results, a significant improvement. For development 
phones, precision increases to 65% on average. (Note that 
the development phones are used to train the classifier.) 
For evaluation phones, precision increases significantly 
to 52%. 

While performance is not perfect, we could likely im- 
prove performance by using a different set of PFSMs 
for each different phone manufacturer. In our evaluation, 
all PFSMs for all manufactures are evaluated at once. 
Because our goal is to allow for phone triage, we don’t 
reduce the set of state machines for each manufacturer; 
however, a set of manufacturer-specific state machines 
could improve performance at the expense of being a less 
general solution. 

We also note that when recall is high, it is easier to 
discover the intersection of information found on two 
independent phones from the same criminal context; that 
intersection is likely to be a better lead than most. 

When necessary, we can prioritize precision over recall. 
Fig. 8 shows the results of culling records for where the 
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Figure 7: Precision and recall for Address Book entries. (Top) 
Results after only Viterbi parsing. (Bottom) Results after post- 
processing. On average, development phones have recall of 
99%, and precision of 56% that increases to 65% after post 
processing. On average, evaluation phones have recall of 93%, 
and precision of 36% that increases to 52% after post processing. 
N.b, The first Nokia has no address book entries at all. 
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Figure 8: Precision and recall for Address Book entries after 
results are culled that do not match phone numbers in DECODE’s 
call logs for the same phone. For some phones, all results are 
culled. On average, development phones have recall of 16%, 
and precision of 92% (when results are present). On average, 
evaluation phones have recall of 14%, and precision of 94% 
(when results are present). 


phone number in the address book does not also appear 
in the call log: precision is increased to 92%, although 
recall drops to 14%. (We don’t show the same process for 
call logs.) This simple step shows how easy it is to isolate 
results for investigators that deem precision of results 
more important than recall. Moreover, the results that are 
culled are still available for inspection. 


Execution time. Inference is the slowest component of 
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DECODE. The post processing step takes a few seconds, 
but the Viterbi component takes significantly longer. On 
average, DECODE’s Viterbi processes 12,781 bytes/sec. 
The smaller phones in our set (Nokias) finish in a few min- 
utes, while the larger Motorola can completed in about 15 
minutes. Since the Viterbi processing already works with 
distinct blocks of input produced by the BHF component, 
it would be straightforward to produce a parallel version 
of Viterbi for our scenario, thereby greatly increasing 
speed. 


5.3 Limitations 


Our evaluation is limited in a number of ways in addition 
to what was previously discussed. First, as with any 
empirical study, our results are dependent on our test 
cases. While our set of phones is limited, it contains 
phones from a variety of makes and models. In future 
work, we aim to test against additional phones. Second, 
our tests are performed only on call logs and address 
book entries. Presently, we are extending DECODE to 
examine other artifacts, including stored text messages. 
Since many phone artifacts are similar in nature — text 
messages are stored as strings, phone numbers, and dates 
— extending DECODE to parse additional types is easier 
than creating the initial PFSMs. 

Our approach also has a number of limitations. First, 
we don’t address the challenge of acquiring the physical 
memory image from phones, which is an input needed for 
DECODE. Here, we have leveraged existing tools to do 
so. However, acquisition is an independent endeavor and 
varies considerably with the platform of the phone. Part 
of our goal is to show that despite hardware (and software) 
differences, one approach is feasible across a spectrum of 
devices. Second, DECODE’s performance is tied strongly 
to the quality of the PFSMs. Poorly designed state ma- 
chines, especially those with few states, can match any 
input. We do not offer an evaluation of whether it is hard 
or time consuming to design high quality PFSMs or other 
software engineering aspects of our problem; we report 
only our success. Third, a single PFSM has an inherent 
endianness embedded in it. DECODE does not automati- 
cally reorganize state machines to account for data that is 
the opposite endianness. Fourth, we have not explicitly 
demonstrated that phones do indeed change significantly 
from model to model or among manufactures. This as- 
sertion is suggested by DECODE’s varied performance 
across models but we offer no overall statistics. 

It is also important to note that DECODE is an inves- 
tigative tool and not necessarily an evidence-gathering 
tool. Tools for gathering evidence must follow a specific 
set of legal guidelines to ensure the admissibility of the 
collected evidence in court. For example, the tool or tech- 
nique must have a known error rate (for example, see 
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Daubert v. Merrell Dow Pharmaceuticals, 509 U.S. 579 
(1993)). 

Finally, our approach is to gather artifacts that match a 
description that may be too vague in some contexts. For 
example, DECODE ignores important metadata that is 
encoded in bit flags that may indicate if a entry is deleted. 
Such metadata can be critical in investigations. It is our 
aim to have DECODE parse more metadata in the future. 


6 Related Work 


Our work is related to a number of works in both reverse 
engineering and forensics. We did not compare DECODE 
against these works as each has a significant limitation 
or assumption that does not apply well to the criminal 
investigation of phones. 

Polyglot [2], Tupni [6], and Dispatcher [1] are 
instrumentation-based approaches to reverse engineer- 
ing. Since binary instrumentation is a complex, time- 
consuming process, it is poorly suited to mobile phone 
triage. Moreover, our goal is different from that of Poly- 
glot, Tupni, and Dispatcher. We seek to extract informa- 
tion from the data rather than reverse engineer the full 
specification of the device’s format. 

Other previous works have attempted to parse machine 
data without examining executables. Discoverer [5] at- 
tempts to derive the format of network messages given 
samples of data. However, Discoverer is limited to identi- 
fying exactly two types of data — “text” and “binary” — 
and extending it to additional types is a challenge. Overall, 
it does not capture the rich variety of types that DECODE 
can distinguish. 

LearnPADS [7,8,25] is another sample-based system. 
It is designed to automatically infer the format of ad hoc 
data, creating a specification of that format in a custom 
data description language (called PADS). Since Learn- 
PADS relies on explicit delimiters, it is not applicable to 
mobile phones. 

Cozzie et al. [4] use Bayesian unsupervised learning to 
locate data structures in memory, forming the basis of a 
virus checker and botnet detector. Unlike DECODE, their 
approach is not designed to parse the data but rather to 
determine if there is a match between two instances of a 
complex data structure in memory. 

In our preliminary work [23], we used the Cocke- 
Younger-Kasami (CYK) algorithm [10] to parse the 
records of Nokia phones. While this effort influenced 
the development of DECODE, it was much more limited 
in scope and function. 

The idea of extracting records from a physical memory 
image is similar to file carving. File carving is focused 
on identifying large chunks of data that follow a known 
format, e.g., jpegs or mp3s. Some file carving techniques 
match known file headers to file footers [18,20] when they 
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appear contiguously in the file system. More advanced 
techniques can match pieces of images fragmented in the 
file system relying on domain specific knowledge about 
the file format [19]. In contrast, our goal is to identify and 
parse small sequences of bytes into records — all without 
any knowledge of the file system. Moreover, we seek to 
identify information within unknown formats that only 
loosely resemble the formats we’ve previously seen. 

DECODE’s filtering component is similar to number 
of previous works. Block hashes have been used by 
Garfinkel [9] to find content that is of interest on a large 
drive by statistically sampling the drive and comparing 
it to a bloom filter of known documents. This recent 
work has much in common with both the rsync algo- 
rithm [22], which detects differences between two data 
stores using block signatures, as well as the Karp-Rabin 
signature-based string search algorithm [13], among oth- 
ers. 


7 Conclusions 


We have addressed the problem of recovering informa- 
tion from phones with unknown storage formats using 
a combination of techniques. At the core of our system 
DECODE, we leverage a set of probabilistic finite state 
machines that encode a flexible description of typical data 
structures. Using a classic dynamic programming algo- 
rithm, we are able to infer call logs and address book 
entries. We make use of a number of techniques to make 
this approach efficient, processing data in about 15 min- 
utes for a 64-megabyte image that has been acquired from 
a phone. First, we filter data that is unlikely to contain 
useful information by comparing block hash sets among 
phones of the same model. Second, our implementation 
of Viterbi and the state machines we encoded are effi- 
ciently sparse, collapsing a great deal of information in a 
few states and transitions. Third, we are able to improve 
upon Viterbi’s result with a simple decision tree. 

Our evaluation was performed across a variety of phone 
models from a variety of manufactures. Overall, we are 
able to obtain high performance for previously unseen 
phones: an average recall of 97% and precision of 80% 
for call logs; and average recall of 93% and precision 
of 52% for address books. Moreover, at the expense of 
recall dropping to 14%, we can increase precision to 94% 
by culling results that don’t match between call logs and 
address book entries on the same phone. 
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mCarve: Carving Attributed Dump Sets 


Ton van Deursen* 
University of Luxembourg 


Abstract 


Carving is a common technique in digital forensics to 
recover data from a memory dump of a device. In con- 
trast to existing approaches, we investigate the carving 
problem for sets of memory dumps. Such a set can, for 
instance, be obtained by dumping the memory of a num- 
ber of smart cards or by regularly dumping the memory 
of a single smart card during its lifetime. The problem 
that we define and investigate is to determine at which 
location in the dumps certain attributes are stored. By 
studying the commonalities and dissimilarities of these 
dumps, one can significantly reduce the collection of 
possible locations for such attributes. We develop algo- 
rithms that support in this process, implement them in a 
prototype, and apply this prototype to reverse engineer 
the data structure of a public transportation card. 


1 Introduction 


In digital forensics, the process of recovering data from 
a memory dump of a device is called carving. The main 
objective of current file carving approaches is to recon- 
struct (partially) deleted, damaged or fragmented files. A 
typical example is the analysis of memory dumps from 
cell phones [1]. Because a file can be permuted in many 
possible ways, the process of reassembling files is very 
labor intensive. Therefore, fully and semi-automatic file 
carving tools have been developed that aid the human in- 
spection process. 

Traditional carving approaches aim to analyze a single 
memory dump. In some cases, however, one may have 
access to a series of similarly structured dumps. This 
may result from observing a system that progresses in 
time, while making memory dumps at regular time in- 
tervals, or from dumping the memory of a collection of 
similar systems. An example is the analysis of the data 
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encoded on a public transportation card. It is possible to 
collect dumps of several cards after each usage. This will 
be the running example throughout this paper. 

We will investigate the problem of carving sets of 
dumps under two simplifying assumptions. The first as- 
sumption is that we can observe certain relevant proper- 
ties of the system at the moment of dumping its memory. 
In this way, we can collect the values of a number of at- 
tributes that characterize part of the state of the system, 
and link that information to the memory dump. An ex- 
ample of such an attribute is the number of rides left on a 
public transportation card, which can be easily observed 
from the display of the card reader when validating the 
card. The carving problem for such attributed dump sets 
is then described as the problem of finding at which lo- 
cation in the memory dump the attributes are stored. 

The second assumption is that the memory layout is 
either static or semi-dynamic. A memory layout is static 
if the attributes are stored at the same location in every 
dump and the dumps have the same length. An attribute 
is stored semi-dynamically if it is stored alternatingly in 
a number of different locations. This will allow us to 
develop algorithms to identify such possible locations in 
dumps. 

Carving dump sets allows one to reverse engineer the 
memory layout of a system and understand or even ma- 
nipulate the system’s functioning. Several applications 
can be thought of. A first example is the analysis of the 
data collected in systems using smartcards, such as the 
transportation card mentioned above. One can e.g. ver- 
ify privacy concerns by inspecting which travel informa- 
tion is stored on the card. Another example is the analy- 
sis of the data structures of an obfuscated piece of soft- 
ware (e.g. malware) or of a piece of software of which 
the specifications have been lost (e.g. legacy code). 

The problem of carving attributed dump sets is differ- 
ent from the traditional file carving problem. While tra- 
ditional file carving tools can be used to obtain informa- 
tion about each dump in a set, the dump set’s evolution 
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and known attributes provide additional information not 
available in traditional file carving. 


Our paper is concerned with the problem of extracting 
this additional information. The main contributions of 
this paper are: (1) to define the problem of carving dump 
sets (Section 3); (2) to develop and analyze a method- 
ology for carving dump sets based on two simple oper- 
ations (Sections 3 and 5); (3) to develop a prototype 
carving tool, called mCarve (Section 7); and (4) to apply 
this tool to reverse engineer the data structure of the e-go 
system (Section 8). 


2 Related work 


Closest to our work are file carving approaches that try 
to recover files from raw data. These approaches try to 
recover the data of a single dump whereas we focus on 
recovering data (and data structures) of a set of dumps. 
Garfinkel [7] describes several carving algorithms that 
recover files by searching for headers of known file for- 
mats. These algorithms reconstruct files based on their 
raw data, rather than using the metadata that points to 
the content. Cohen [2] formalizes file carving as a con- 
struction of a mapping function between raw data bytes 
and image bytes. Based on this formalization, he de- 
rives a carving algorithm and applies it to PDF and ZIP 
file carving. In recent work, Sencar and Memon [10] 
describe an approach to identify and recover JPEG files 
with missing fragments. Common to these file carving 
approaches is that they are designed for one (or a small 
set of) known file format(s). 


More general, but perhaps less powerful are the ap- 
proaches that analyze binary data by visual inspection. 
Conti et al. [3] describe a tool that allows analysts to vi- 
sually reverse engineer binary data and files. Their tool 
supports simple techniques such as displaying bytes as 
pixels, but also more complicated techniques that visu- 
alize self-similarity in binary data. Helfman [8] first vi- 
sualized self-similarity in binary data using dotplot pat- 
terns. Using dotplot patterns he revealed redundancy in 
various encodings of information. 


Some information in a memory dump may be con- 
structed using CRCs, cryptographic hashes, or encryp- 
tion. Since the entropy of these pieces of data is higher 
than of structured data, they can be detected using en- 
tropy analysis. Several methods to efficiently find cryp- 
tographic keys are described in [11]. Some of these tech- 
niques are based on trial-and-error, while others identify 
possible keys by measuring entropy. Testing whether a 
given string is random has been studied extensively. See 
e.g. [9] for an overview and implementation of the most 
important algorithms. 
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3 Carving attributed dump sets 


The concept that is central to our research is the concept 
of a dump. A dump consists of raw binary data that is 
captured from a system, for instance, from a computer’s 
memory, a data carrier or a communication transcript. 
An example of a dump is the contents of a public trans- 
portation card’s memory. 

We assume that the process of creating a dump can be 
repeated, allowing us access to a number of dumps of the 
same system. We call such a collection of dumps a dump 
set. One can, e.g., consider dumps of a number of public 
transportation cards, both before and after their use. We 
assume that different dumps of the same system have the 
same length. If we denote the bit strings of length n € n 
by B” and bit strings of arbitrary, finite length by B*, 
then a set of dumps of length n is denoted by S C B”. 
The length n of bit string s € B” is denoted by |s| es 
the number of elements in set S' is denoted by |S]. 
this paper, the closed interval |i, 7] will denote the set i 
integers z such that i < z < 7 and the half-open interval 
(i, 7) will denote the set of integers z such thati < z < j. 
For i € [0,|s]) we denote the i-th bit of s by s;. For 
I C (0, |s|), we denote the subsequence of s that consists 
of all elements with index in I by s|;. The subsequence 
operator extends to sets of dumps in the obvious way. 

A dump contains information about the state of the 
system, e.g., the number of rides left on a public trans- 
portation card or the last time that it was used. We call 
such state properties attributes. For each dump set we 
consider a set A of attributes. The function type: A > 
assigns to every attribute a finite value domain, where 
denotes the set of all finite value domains. The value 
of attribute a € A expressed in dump s is denoted by 
val,: S — type(a). For instance, the type of the at- 
tribute rides-left can be [0, 15] and a particular dump s of 
a card can have 5 rides left, so valyiaes-te#(S) = 5. The 
type of the attribute /ast-used is the set of all dates be- 
tween 1/1/2000 and 1/1/2050, extended with the time of 
day in hh:mm:ss format. 

A dump contains the system’s attribute values in a bi- 
nary representation. The mapping from an attribute do- 
main to its binary representation is called an encoding. 
We assume that for a given attribute a € A the length of 
an encoding is fixed, so an encoding of a is a function 
from type(a) to B” for some n € N. This function is re- 
quired to be injective. For the public transportation card, 
a sample encoding of the rides-left attribute is the (5-bit) 
binary representation and a possible encoding of the last- 
used attribute is the number of seconds since 1/1/2000, 
00:00 hrs modulo 2” expressed in binary format. The 
set of all encodings of D € D is denoted by Ep. 

We start with the assumption that an attribute is always 
stored at the same location in all dumps of the system. In 
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Section 5 we will extend this to semi-dynamic attributes. 
With this assumption we can identify which bits of the 
dump are related to a given attribute. This is captured in 
the notion of an attribute mapping. Here we denote the 
powerset of a set X by P(X). 











Definition 1. Let S C B” be a dump set with dumps 
of length n. An attribute mapping for S is a function 
f: A— P([0,n)), such that 




















Va EA Fe €Etype(a) V8 ES: 8| f(a) = e(vala(s)). 
An attribute mapping is non-overlapping if 
VYayi,a2€A: a, Aag => f(ai)Nf(a2) =. 


An attribute mapping is contiguous if 





VYaeAdi,j <n: f(a) = fi, 7). 

Given a dump set S and all attribute values for each 
dump in S, the carving problem for attributed dump sets 
is the problem of finding an attribute mapping for S. 

The existence of such a mapping does not imply that 
the attributes are indeed encoded in the dump, but merely 
that they could have been encoded at the indicated po- 
sitions in the dumps. Conversely, if an attribute can- 
not be mapped in S, it means that this attribute is not 
present through a deterministic, injective encoding. Of 
course, this does not rule out the possibility that a non- 
deterministic encoding is used, such as a probabilistic 
encryption, or that the attribute is stored dynamically, 
i.e. not always at the same location. We consider the 
search for high-entropy information and semi-dynamic 
attributes later in this paper. 

The notion of an attribute mapping is illustrated in Fig- 
ure 1. This example consists of five dumps, s1,... 55, of 
length n = 18. We look at the attribute rides-left (rl) 
with the values as given in the figure and we consider 
two possible encodings enc; and encg. The first encod- 
ing is the standard binary encoding of natural numbers. 
It can be found in the dumps at two different (contigu- 
ous) positions: [5, 8] and [12, 15]. The second encoding, 
which is not standard, occurs at positions [3, 6]. Each of 
these three cases defines a contiguous attribute mapping 
for rides-left. There might be more candidate encodings. 


4 Commonalities and dissimilarities 


Given the values of an attribute for the dumps in a dump 
set S, we can use the commonalities and dissimilarities 
of these dumps to derive restrictions on the possible at- 
tribute mappings for S. Such restrictions are derived in 
two steps. In the first step we look at dumps that have 
the same attribute value. In this case, we can derive those 
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82}4]001100100001010010] 0100) 1001 
83|5/101110101011010100) 0101) 1101 


84}6]001010110111011011] 0110 
85}6]111010110011011001] 0110 
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Figure 1: Example of a dump set with three possible at- 
tribute mappings. 


positions in the bit strings that cannot occur in the encod- 
ing of the attribute. In the second step we look at dumps 
of which the attribute values differ, allowing us to deter- 
mine positions in the bit strings that should occur in the 
encoding of the attribute. 

For the first step, we start by observing that an attribute 
a € A induces a partition 


bundles(a, S) = 
{{s © S | val.(s) = d} | d € type(a)} 


on a dump set S. An element of this partition is called a 
bundle. Thus, a bundle is a set of dumps with the same 
attribute value. For instance, Figure 1 shows three bun- 
dles for attribute rides-left (rl), namely {s1, 52}, {53}, 
and { S4, 85}. 

The common set determines which bits in the dumps 
of a dump set are equal if the attribute values are equal. 











Definition 2. Let a € A be an attribute and S C B” 
be a dump set. The common set of S' with respect to a, 
denoted by comm(a, S') C [0, 1), is defined by 





comm(a, S') = 


() {i € [0,n) | Vs,5' €b: 5; = s;}. 
bebundles(a,S) 


An example is given in Figure 3. The elements from 
the common set are marked with an asterisk. 

Given that the encoding of an attribute value is deter- 
ministic, this gives an upper bound on the bits used for 
this attribute. 


Lemma 1. Let A be an attribute set and let f be an at- 
tribute mapping for dump set S C B”, then 














1. Va€ A: f(a) C comm(a, $), 


2. if Ia © [0,n) is a family of sets fora € A, such 
that f(a) C Ig C comm(a, S$’), then the function 
f’: A —> P((0,n)), defined by f'(a) ++ Iq, is an 
attribute mapping. 
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The first property states that every possible attribute 
mapping is enclosed in the common set, so one can re- 
strict the search for attribute mappings to the locations in 
the common set. The second property expresses that ev- 
ery extension of an attribute mapping is also an attribute 
mapping, provided that it does not extend beyond the 
common set. 

Next we look at dumps with different attribute values. 
Injectivity of the encoding function implies that the en- 
coding of two different values must differ at least in one 
bit. This is captured in the notion of a dissimilarity set. 
This set consists of all intervals that, for each pair of 
dumps with a different attribute value, contain at least 
one location where the two dumps differ. 














Definition 3. Let a € A be an attribute and S C B” be 


a dump set. The dissimilarity set of S with respect to a, 
denoted by diss(a, S) C P([0, n)), is defined by 


diss(a, S) = {I C [0,n) | 
Vs,s’ © S: (vala(s) 4 vala(s’) => 
Hie I: 8; 4 s4)} 


An example of the dissimilarity set is given in Fig- 
ure 4. The next lemma expresses that every attribute 
mapping is an element of the dissimilarity set. Conse- 
quently, we can restrict the search for possible attribute 
mappings to the elements of the dissimilarity set. 





Lemma 2. Let A be an attribute set and let f be an 
attribute mapping for dump set S C B”, then Va € 
A: f(a) € diss(a, S). 


An encoding of an attribute value a must at least con- 
tain the indexes from one of the sets in diss(a, S). This 
implies that we are mainly interested in the smallest sets 
in diss(a, S), i.e. those sets of which no proper subset is 
in diss(a, S). In order to make this precise, we introduce 
some notation. 

Let F be a set and let P C P(E’). We define the 
superset closure of P, notation P, by P = {p C F | 
Ap’ € P: p' Cp}. A set P is superset closed if P = P. 
We observe from its definition that diss(a, 5’) is superset 
closed. 

Given P C P(F), we say that P is subset minimal 
if for every p,p’ € P, p'’ Cp => jp’ =p. Thus,a 
collection of sets is subset-minimal, if no set is a strict 
subset of any other set in the collection. 





Lemma 3. Let F be a finite set and let P C P(F). 
Then there exists a unique subset-minimal set Q such that 


Q=P. 


Given P as in Lemma 3, we denote the unique subset- 
minimal set by smin(P). Then, in order to determine 
whether an encoding of an attribute contains at least the 
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indexes from one of the sets in diss(a, S), it suffices 
to verify that it at least contains one of the sets from 
smin(diss(a, S)). 

By combining the results of the previous lemmas, we 
get the following main result. 


Theorem 1. Let A be an attribute set and let f be an 
attribute mapping for dump set S C B”, then 





Va € ASI € smin(diss(a, S)): 
IC f(a) C comm(a, S$). 


This theorem says that if an attribute is expressed in 
a dump set, then its encoding position should contain at 
least one of the minimal dissimilarity sets and may not 
go beyond the common set. 

A consequence of the theorem is that by calculat- 
ing diss(a, S') and comm(a, $'), we can limit the search 
space when looking for the attribute mapping f(a) in the 
dumps. We will now investigate how to further limit the 
search space. 

Let filter(A,c) = {a € A | a C c} denote the filtra- 
tion of a collection of sets in A with respect to a set c. It 
is easy to see that the sets of interest for an attribute map- 
ping in Theorem | are characterized by the following set 


smin(filter(diss(a, 5’), comm(a, S))) (1) 


Let R be a set of representatives of bundles(a, S$), i.e. 
Yb € bundles(a, S) dls € R: s € b. The following the- 
orem states that the set 


smin(diss(a, Fi eoauata3))) (2) 


contains the same index sets as (1). Expression (2) sug- 
gests, however, a smaller search space than (1), since the 
diss function is computed only over a restricted set of 
indexes and a subset of the dump set. 


Theorem 2. Let a € A be an attribute and S C 
"a dump set. Let R be a set of representatives 
of bundles(a, S$). Then smin(diss(a, Rlcomm(a,s))) = 
smin(filter(diss(a, S), comm(a, S))). 














To build up our intuition, we first formulate the lemma 
that by expanding a dump set we might be able to locate 
an attribute more precisely. 














Lemma 4. Let S, S’ C B” be dump sets anda € A an 
attribute. Then S’ C S == > diss(a, S’) D diss(a, S). 


The preceding lemma indicates in particular that a 
dump set contains more information about an attribute 
than its subset of representatives. If we filter the 
diss(a, S') sets with respect to the comm(a, 5S’) set, how- 
ever, then the representatives are sufficient. 
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Lemma 5. Let S C B” be a dump set anda € 
A an attribute. Let R be a set of representatives of 
bundles(a,$). Then filter(diss(a, S'),comm(a,S)) = 
filter (diss(a, R), comm(a, $)). 


The filter with respect to the comm(a, S) set in the 
preceding Lemma is indeed necessary. In general, the 
set diss(a, R) does not coincide with diss(a, S). 

Consider, for instance, the three two-bit dumps s; = 
O01, sg = 00, and s3 = 11. Suppose the dumps en- 
code the attribute @ with valg(s1) = vala(s2) = A and 
valq(s3) = B. Then we have the following bundles and 
dissimilarity sets. 


bundles(a, {1, 52, 53}) = {{s1, 52}, {s3}} 
diss(a, {81, 82, 83}) = {{O}, {0, 1}} 
= {{0}} 
diss(a, {82,83}) = {{O}, {1}, {0, 1}} 
= {{0}, {1}} 


Thus, in spite of the fact that s; and s2 have a common 
value for the attribute a, considering both in the dissimi- 
larities set provides more information. 

Finally, if we assume that the sizes of the attribute 
value domains are known, we have an information- 
theoretic lower bound on the number of bits that must 
have been used for encoding the attribute. This is ex- 
pressed in the following lemma, which can be used to 
further limit the search space. The lemma follows from 
the pigeonhole principle. 


Lemma 6. Let A be an attribute set and let f be an 
attribute mapping for dump set S C B", then Va € 


A: |f(a)| = logs(|type(a))). 


In Section 6, we will investigate algorithms for deter- 
mining the sets smin(diss(a, S)) and comm(a, S). 














5 Cyclic attribute mappings 


In this section we extend our results to a class of dy- 
namic mappings, which we call semi-dynamic or cyclic 
mappings. Cyclic mappings can, for instance, be used to 
store trip frames on a public transportation card. Such 
a trip frame contains all information related to a single 
ride. Trip frames are stored in one of a fixed number of 
slots in the card’s memory. When validating the card for 
a new ride, a new trip frame will be written to the next 
available slot. If all slots have been filled, the next trip 
frame will be written to the first slot again, etc. We will 
show that cyclic mappings can be detected by the same 
algorithms as static mappings at the cost of introducing 
a number of derived attributes. 

Because cyclic mappings consider the evolution of 
a given object in time, we will first assume additional 
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structure on the dump set corresponding to the history 
of an object. We assume that for each dump we can de- 
termine to which object it belongs through the attribute 
id (e.g. the unique identifier of a public transportation 
card). For each object we further assume that its dumps 
are ordered as expressed by an attribute segnr. 











Definition 4. Let S C B” be a dump set and let id and 
seqnr be attributes. We say that the pair (id, seqnr) is a 
bundle-ordering if type(segnr) = N and 





Yb € bundles(id, S') Vs, s’ € b: 
be vals 8) 7 lca 2 


Because the combination of a device identifier and a 
sequence number uniquely determines a dump, we can 
consider an attribute a as a function on type(id) x N. 
Given i € type(id) and n € N we will thus write a(i, n) 
for val,(s), where s € S is the dump uniquely deter- 
mined by valjg(s) = i and valsegnr($) = n. 

Using this notation, we are now able to derive new at- 
tributes from a given attribute a. In particular, we can 
consider the history of a device. An example is the 
attribute aj, which determines the a-value of the di- 
rect predecessor of a dump. This attribute is defined by 
a4(i,n) = a(i,n — 1). It is defined on a subset of S, 
viz. 





{seéS| As’ € S: valia(s’) = valia(s)A 


Valacar (8 ) =Valpag(s) — 1) 


This generalizes to a. forr € N. By extending the set of 
attributes with such derived attributes, we can automati- 
cally verify if a dump contains information on the history 
(i.e. the previous states) of a device. 

This technique is particularly useful when dealing 
with cyclic attribute mappings. A cyclic mapping of 
attribute a considers a number of locations to store the 
value of a, e.g., [¢1, 1), [é2,j2) and [é3, j3). In the first 
dump of an ordered id-bundle the value of a is stored at 
[t1,j1). In the second dump a is stored at [i2, 72), etc. 
The location for the fourth value of a is again [#1, j1). 

In order to locate a cyclic mapping for attribute a, 
we will derive new attributes dcycie(x/c), Where c is the 
length of the cycle and x is a sequence number (0 < z < 
c). Using notation |r| for the floor of rational number 
r, we obtain the following extensional definition of these 
new attributes: 


nm 2X 





In Figure 2 we show the attributes derived from the rides- 
left (rl) attribute, assuming a cyclic mapping of length 
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3. The dumps s; to s5 are consecutive dumps of a sin- 
gle card. In order to find the cycle length of a cycli- 
cally mapped attribute, it suffices to search for attributes 
Acycle(0/c)» Where c ranges from 2 to the expected max- 
imum cycle length. In the figure we denote rl oycie(x/c) 
by Tly Jc 





rl rlo/g3 rlisg rle/3 SEQNT mr od(3 
S1 8 8 5 = 1 
$2 | 7 8 it - 2 
s3 | 6 8 7 6 3 
S4 | 5 5 t 6 1 
85 | 4 5 4 6 2 
85 | 3 5 4 3 3 


Figure 2: Derived attributes with cycle length 3. 


We conclude our observations on cyclic mappings by 
considering pointers to such attributes. An example is the 
use of a pointer (at a static location), pointing at the block 
in memory where the information on the most recent trip 
is stored. Clearly, if the trip information is stored al- 
ternatingly at different locations, the pointer will have a 
similar cyclic behaviour. We can search for such cyclic 
pointers by introducing attributes segnr,,,,q(¢), Which 
consider the sequence number of the dump modulo cy- 
cle length c. Figure 2 contains an example for c = 3. 


6 Algorithms 


In the following we concern ourselves with the two basic 
carving algorithms, comm and diss. 


6.1 Commonalities 


The algorithm computing the comm function identifies 
all positions in which given bitstrings have the same 
value. We implement it using the function fc: P(B*) x 
P(N) > P(N) which we define recursively as follows, 
using the symbol W for the disjoint union of sets. 


fc(0, I) =I 
fe({s}, I) =I 
fc(SU{s, s’}, I) = fe(S U {s}, {0 © I | 5; = s/}) 


Obviously, for dumps of length n, 
comm(a, S') = () 


be€bundles(a,S') 














fc(b, [0, 7)). 


The bit complexity of this step is O(n - |.S|). 

The function comm is illustrated in Figure 3. For 
each of the three bundles we have calculated the fc set 
as the set of all positions where all dumps from the bun- 
dle agree on the bit (indicated by the asterisk symbols). 
Finally, the comm set cm is the intersection of these fc 
sets. 
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rl | dump 

8, | 4 | 010100100111010000 

82 | 4 | 001100100001010010 
Ke KRKKKKK KK KKK LK fe 

83 | 5 | 101110101011010100 
KKKKKKKKKKKKKKKKKE fc 

s4 | 6 | 001010110111011011 

S5 | 6 | 111010110011011001 
eK KKKKKK LK KKKKEK LK fe 
ee KKKKKK. KR KKKK.L RK CM 








Figure 3: Calculation of the comm set. 


6.2 Dissimilarities 


Given a set of bundles, the algorithm for the diss func- 
tion identifies intervals in which any two bitstrings from 
different bundles differ in at least one position. 

We implement the diss function in the case where the 
attribute mapping is assumed to be contiguous using the 
dissimilarity interval function iv(a, S) (7). It denotes the 
shortest interval that a contiguous encoding of attribute 
a must have if it is to start at position 7. Such an interval 
does not exist if there are dumps in S' which do not differ 
at any position in [7, 7). 











Definition 5. Let a € A be an attribute and S C B” be 
a dump set. The dissimilarity interval function iv(a, S) : 
[0,n) > P([0,n)) U {L} of S with respect to attribute 
a is defined by 





iv(a, S')(i) = [¢, min{k € |i, n) | 
Vd,d' € S: vala(d) 4 vala(d’) => 
ajo G0 <9 SkAd + a,)} | 





if the minimum exists and else. 


The following lemma expresses that the dissimilarity 
set for contiguous attribute maps can be obtained from 
the dissimilarity interval function. To state the lemma, 
we first need to define subset minimality and superset 
closure for sets of intervals. 

Let Z, = {[t,7] C N | ¢,7 < mn} be the set of intervals 
in [0,n). We define the interval-superset closure of a set 
PCT, by {p € fT, | Ap’ € P: p’ C p}. It is easy 
to see that the interval-superset closure of P is equal to 
PT,. A set P is said to be interval-superset closed if 
P CT, and P = PAT,. We say that P is interval- 
subset minimal if P C Ty, and for every p, p’ € P, p’ C 
p => vp’ = p. It is also easy to see that for every 
set of intervals P C Z,, there is a unique interval-subset 
minimal set Q C Z,, such that Q ALpn = PNTL,. The 
proof is analogous to the proof of Lemma 3. 
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Lemma 7. Let S C B” be a dump set anda € Aan 
attribute. Let the set T be defined by 


T = {iv(a, S)(t) € IZ, | i € [0,n) A 
iv(a, S)(i +1) Z iv(a, S)(i)} 


then T is the interval-subset-minimal set that satisfies 
TOI, = diss(a, S) In. 


To compute iv(a, S)(i) for 7 € [0,n), we assume for 
simplicity of exposition that no two dumps in S' have the 
same value for attribute a, that is, we are restricting our- 
selves to a set of representatives R of bundles(a, S). 

A naive algorithm for iv(a, R)(7) is to first compare 
two dumps from RF then to iterate over all remaining 
dumps in R comparing each new dump to the first two 
dumps and all dumps that have already been iterated 
over. In each comparison of two dumps, the first po- 
sition after position 7 in which the two dumps differ 
is sought for. The maximal such position is returned. 
More precisely, let fiv : P(B*) x N > NU {-—co, oo} 
be defined recursively as follows. Note that we adopt 
the conventions min(@) = oo, max(oo,k) = oo, and 
max(—oo,k) = k for all k € NU {—co, oo}. 


fiv(Q,2) = —oo 
fiv({s}, 7) = —00 
fiv({s,s’},7) = min{k EN | k >i,5, 4 si} 
fiv(RU{s}) = max(fiv(R, 2), 
me 2 
ma{fiv({s,s'},i)}) 
Then for any set R of representatives from 
bundles(a,S), we have iv(a,R)(i) = [i, fiv(R,i)] 
if fiv(R,i) © N and iv(a,R)(i) = L else. The 


number of comparisons of two dumps, i.e. the num- 
ber of calls to fiv({s,s’},7), is easily seen to be 
quadratic in |R|. We can improve the number of 
comparisons to O(|R|log|R|) by sorting the set of 
dumps first. We will write s <; s’ if and only if 
Aj € fi,n): sj < s; AViS k <j: 8% = 5). We will 
write s <; s’ ifs <; s’ or Vj € [i,n): 3; = Si. 

A more efficient algorithm A to compute iv(a, R)(i) 
runs as follows. 





1. Sort the dump set FR in ascending order with respect 
to <,;. Let s@) SS g(2) Si ac Say (IRI) be the 
sorted list of these dumps. 


2. For j from 1 to |R| — 1, compare s“) with sV+1), 
For the comparison, start with the z-th bit and move 
towards the n — 1-st bit. Let k; be the index of the 
first bit in which s differs from s%+), If no such 
bit exists, output and stop. 


3. Output the interval |i, max;e/1,)R1)(k;))- 
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Theorem 3. Let S C B” be a dump set anda € A 
an attribute. Let R be a set of representatives of the 
sets in bundles(a, 9). Then the set T with TNT, = 
diss(a, R) NZ, is computed by A in time O(n? |R| + 
n|R|log |). 


The calculation of the diss set is explained in Figure 4. 
We start by taking a representative of each of the bundles. 
Then, starting from the left, we calculate for each posi- 
tion how far to the right we must go in order to find a 
distinguishing bit for each pair of dumps. For position 0 
the first two bits already make a distinction between the 
three dumps, which gives the interval [0, 1] (indicated by 
the first line with asterisk symbols). For position 1 we 
need three bits, because s3 and s4 coincide at positions 1 
and 2. This gives the interval [2, 4], etc. Those sets be- 
longing to the subset-minimal diss set are marked with 
“minimal”. 


rl | dump 
s; | 4 | 010100100111010000 
83 | 5 101110101011010100 
S4 6 001010110111011011 
Co are eee ee eee minimal 
(ROK 6 re eek Pw Ss 
Mtl ar fy ee Osan eles thee minimal 
st Wt eh am silahts eG auan Sis minimal 
RIOR gi sse aap SS minimal 
acde-s KEE ih eae 
bie weg KA Bia e og artaae aire 
semeycla eae KReeeeeee.. =~=©minimal 
etc. 








Figure 4: Calculation of the diss set. 


If we combine the comm set from Figure 3 and the 
diss set from Figure 4, under the assumption that the 
number of rides is encoded with 4 bits, we obtain the 
four remaining possibilities from Figure 5. This result 
includes the three possible attribute mappings from Fig- 
ure |. 


rl | dump 

010100100111010000 
001100100001010010 
101110101011010100 
001010110111011011 
111010110011011001 





Figure 5: The resulting attribute mappings. 
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7 The mCarve tool 


We have implemented the algorithms of Section 6 in a 
prototype called mCarve [12]. The prototype allows the 
forensic analyst to input a collection of dumps and a col- 
lection of attributes. Each of the dumps can be accompa- 
nied by its attribute values. The prototype was written in 
Python and consists of approximately 1200 lines of code 
(excluding graphical user interface). 

After entering the dumps and attributes the user can 
run the commonalities algorithm for an attribute. The 
output of the algorithm is the set of indexes J for which 
all dumps with the same attribute value are the same. The 
set I is used as a coloring mask to display any dump d 
selected by the user: if i € J, then d; is colored blue, 
otherwise red. The dissimilarities algorithm computes a 
subset-minimal set of dissimilarity intervals. Since these 
intervals may be overlapping, the prototype enumerates 
them rather than showing them as one coloring mask. 
This allows the user to step through the intervals. The 
prototype displays the interval iv by applying a yellow 
coloring mask to all bits d; for 7 € iv. A combined pro- 
cedure consolidates the results from the commonalities 
and dissimilarities algorithms. 

The prototype further allows users to specify two types 
of special attributes: a constant attribute and a hash at- 
tribute. The former has a constant value for all dumps 
and can be used to determine which bits never change. 
The latter has a different value for all pairwise differ- 
ent dumps and can be used to detect encrypted attributes. 
The tool allows one to derive new attributes from other 
attributes. These derived attributes can be used to find 
cyclic attribute mappings. The tool further allows one to 
apply an encoding to a selected interval in each dump. A 
number of standard encodings, such as ASCII and base 
10, are implemented. Aside from displaying the out- 
put onscreen, the user can choose to export the results 
to JPEG or to I4TpX (see Figure 7 for an example). 


7.1 Performance 


We illustrate the performance of our prototype by run- 
ning our prototype on a generated test suite. The test 
suite consists of dumps of sizes 8KB, 16KB, 32KB, 
64KB, 128KB, and 256KB. For each file size, 5 dump 
sets were generated. Each dump embeds one attribute at 
a random position and is encoded in at most 64 bits. The 
remaining bits are randomly generated. 

The running time of the commonalities procedure is 
linear in the number of dumps and the dissimilarities pro- 
cedure is quadratic in the number of bundles. Therefore, 
the execution time of the combined procedure is mainly 
dependent on the number of bundles in the dump set. 
Convergence tests show that, in general, fewer than 10 
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bundles are needed to find an attribute in a dump set. 
This allows us to restrict our performance tests to dump 
sets of 10 bundles. 


A 
120 + 
100 + 





256 KB 
128 KB 
64 KB 
32 KB 
16 KB 














OxObDe 








time (s) 











bundles 


Figure 6: Performance 


The tests were run on a Linux machine (kernel 2.6.31- 
22) with Intel Core 2 6400 @ 2.13 GHz processor run- 
ning Python 2.6.4. Figure 6 shows on the horizontal axis 
the number of bundles included in the dump set. On the 
vertical axis it shows for each of the file sizes the time in 
seconds (averaged over the 5 dump sets) needed to per- 
form the combined procedure. The test shows that our 
prototype is best suited for dumps of size smaller than 
32KB, but it can deal reasonably well with size up to 
256KB. Initial experiments have shown that performance 
of the tool can be significantly improved by implement- 
ing the core procedures in a lower-level language. 


7.2 Convergence 


Another interesting measure for the mCarve tool is the 
rate of convergence of the carved intervals. We will mea- 
sure it by computing the number of dumps that are nec- 
essary in order to find an attribute in a dump set. For sim- 
plicity, we assume that the dumps as well as the attribute 
values are given by a uniformly random distribution. 


Let g denote the bit length of the attribute’s encoding 
in the dump, let NV denote the number of dumps and let x 
be the number of bundles. We first compute the probabil- 
ity of false positives, i.e. the probability of an accidental 
occurrence of values matching an attribute. The proba- 
bility that the bit string formed by a particular interval of 
q bits in all N dumps matches a particular given string 


of bits is 2-7’. There are - x! possible encodings 


x 
of x different values. The probability that the q bits in all 
NN dumps match one of these representations is therefore 
ou (*) al. 

x 
Thus if 7 denotes the length of the bit strings represent- 
ing dumps, then the probability p,,¢, of no false positives 
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is given by 


Q-aN , “ee 


nfp 2 | 1 - —-_——— 
Pa 2 ( (27-2)! 


The inequality is due to the fact that the product on the 
right does not concern independent trials. We are inter- 
ested in those values of x and N for which the probability 
Pnfp 1s large enough that the discovery of an attribute is 


not coincidental. 
n nk 
Using the inequality k < a we obtain 


—qN I—q+1 fs 
(BEB Gen 
— x)! 


Fixing the number of bundles z and a false positive 
probability €, we obtain the following inequality for the 
number of dumps NV: 


l—q+1 
(1 a ane) Sila€ 
Thus 
—1 1 
N > — log, (1 —(1- ema) +2. 
qd 


This formula can be used in two ways. If we know the 
length q of the encoding, we fix a number of bundles x 
and a false positive probability « and compute the num- 
ber of dumps N needed for convergence. If we do not 
know the length of gq, we set it to log.(a) and perform 
the same computation. For instance, for dumps of length 
| = 1024, false positive probability of « = 0.05, number 
of bundles z = 4, and length g = log.(x) = 2 we get 
N > 11.14. This means that to have convergence with 
probability 0.95 we need to analyze 12 dumps compris- 
ing 4 different attribute values. 


8 Case study: The E-go system 


We illustrate our methodology by reverse engineering 
part of the memory structure of the Luxembourg public 
transportation card. 


8.1 The E-go system 


The fare collection system for public transportation in 
Luxembourg, called e-go, is based on radio frequency 
identification (RFID) technology. The RFID system con- 
sists of credit-card shaped RFID tags that communicate 
wirelessly with RFID readers. Readers communicate 
with a central back-end system to synchronize their data. 
Travelers can buy e-go cards with, for instance, a book 
of 10 tickets loaded on it. Upon entering a bus, the user 
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swipes his e-go card across a reader and a ticket is re- 
moved from the card. 

Since most RFID readers of the e-go system are de- 
ployed in buses the e-go is an off-line RFID system [5]. 
Readers do not maintain a permanent connection with 
the back-end system, but synchronize their data only in- 
frequently. Since readers may have data that is out-of- 
date and tags may communicate with multiple readers, 
the e-go system has to store information on the card. 

The RFID tags used for the e-go system are, in fact, 
MIFARE classic 1k tags. These tags have 16 sectors that 
each contain 64 bytes of data, totaling | kilobyte of mem- 
ory. Sector keys are needed to access the data of each 
sector. Garcia et al. [4, 6] recently showed that these 
keys can be efficiently obtained with off-the-shelf hard- 
ware. Therefore, it is easy to create a memory dump of 
an e-go card. 


8.2. Data collection 


Over a period of 2 months, we collected 68 dumps for 
7 different e-go cards of different types. Four cards are 
of type 10-rides/2nd-class, two of type 1-ride/2nd-class 
and one of type /-ride/Ist-class. According to informa- 
tion published by the transportation companies, a card 
can contain up to 6 products of the same type. We con- 
sidered two classes of events that change the state of a 
card: (1) charging the card with a new product (including 
the purchase of a new, charged card), and (2) validating 
a ride by swiping the card. After each event we dumped 
the memory of the card as a binary file. This gave a se- 
quence of consecutive events for each card. 

Because the e-go system is an off-line system, we ex- 
pected to find several attributes encoded on the card. For 
each event we therefore collected some contextual in- 
formation, which we attributed to the dump following 
the event. For charge events we collected the following 
attributes: card id (the decimal number printed on the 
card); charged product; date, time and location of charg- 
ing; card charger id (as printed on the coupon). For val- 
idation events we collected: card id; date and time of 
swiping; expiration time of the ride; card reader id (be- 
cause the card readers have no visible identification we 
collected the license plate number of the bus and the lo- 
cation of the reader within the bus); rides left; bus num- 
ber; bus stop. 

These are the attributes that one would expect to find 
on the card and that are easy to observe. Most of these 
attributes can be obtained by reading the sales slips or 
the display of the reader. Since cards are purchased 
anonymously, no personal identifying information, such 
as name, address, or date-of-birth can be stored on the 
card. 

In addition to our basic set of dumps, we had access 


20th USENIX Security Symposium 115 


116 








ci shell sector 
= = 












































ee J = UE — i eS 
ae Ez ms Ue — — si 
a eee ee ee es a, product 
<i = Ue - 5 a TL sectors 
i ee ee ee 
a ES = UE — i rn 
= _ == 7" 
— transaction 
at Peecions 
empty 
sectors 





= constant 0 


' =constant 1 





@ = variant 


Figure 7: E-go memory layout (applying common to a unique attribute). 


to 47 dumps from earlier experiments which were less 
structured and less documented. We used these dumps to 
validate the results of the experiments with our main set 
of dumps. 

It is important to note that our analysis is entirely pas- 
sive: no data on the card needs to be modified and no 
data needs to be written to the card. 


8.3 Data analysis 


Using our tools, we verified the presence of three classes 
of attributes: (1) external attributes (i.e., the observable 
attributes mentioned above); (2) internal attributes (re- 
lated to the organization of the data within the card’s 
memory, such as a pointer to the active sector); and (3) 
attributes with high entropy (such as CRCs and crypto- 
graphic checksums). We also searched for cyclic ver- 
sions of these attributes. 


Memory layout. The first step in our analysis is to de- 
termine the general memory layout of an e-go card. For 
this purpose we apply the commonalities algorithm to 
the constant attribute, i.e., an attribute that has a con- 
stant value for every dump. The result of this operation is 
shown in Figure 7. The card’s memory is displayed in 64 
lines of 128 bits, giving a total of 8192 bits (1kB). Bits 
that have a constant value in all dumps are colored dif- 
ferently from bits that vary in value. The recurring struc- 
tures immediately suggest a partitioning of the memory 
into 16 sectors of 4 lines each. There seem to be four dif- 
ferent types of sectors. The structure of the first sector is 
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unique. We call this sector the shell sector. Lines 2 and 
3 of the shell sector are identical. Next there are seven 
sectors with a similar appearance (three of these look a 
bit less dense than the others because they are used less 
frequently in our dump set). We call these sectors the 
product sectors. The next five sectors are similar. We call 
them transaction sectors. Finally, there are three empty 
sectors, which we will ignore for the rest of our analysis. 
They are probably reserved for future extensions of the 
e-go system. 

Further inspection shows that the last line of each sec- 
tor is constant (over all dumps). This is the 16 byte sector 
key. Because the last lines of each of the sectors (except 
the empty sectors) are equal, we can conclude that the 
same key is used for all sectors.! 


External attributes. The second step in our analysis is 
to carve the external attributes. This step only revealed 
the card ID. We can conclude that the other external at- 
tributes are either not represented on the card or not at 
a static location. Figure 8 shows for each sector type 
which attributes were discovered with our tool. The card 
ID, which is located in the shell sector in Figure 8, is de- 
tected as follows. The output of our tool on the card ID 
attribute consists of a number of intervals between bits 0 
to 37 plus the interval 35 to 108. Clearly, the last interval 
is too large to contain the card ID, so we can consider that 
interval a false positive. We conclude that bits 0 to 37 are 


'Tn order to not reveal sensitive data, we display keys that are dif- 
ferent from those used in the e-go system. 
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Figure 8: Attributes located in the three sector types. 


related to the card ID. Indeed, the MIFARE standard de- 
scribes that identification numbers are hard-coded in the 
first 32 bits (4 bytes). If we reverse these 4 bytes and in- 
terpret them as a decimal number, we obtain the number 
printed on the card. The fact that bits 32 to 37 relate to 
the card ID is also consistent with the MIFARE standard 
because bits 32 to 39 contain the checksum of the card 
ID. 


Internal attributes. The tool can be used to step 
through a sequence of dumps and observe the changes 
between consecutive dumps. In this way, one can step 
through the “history” of a particular card and observe re- 
curring patterns. This process indicates a periodicity in 
the updates of the transaction sectors of the e-go card. 
Successive validation events write to successive transac- 
tion sectors, thereby cycling back from the fifth transac- 
tion sector to the first. One would expect a similar pe- 
riodicity in the product sectors, but that is not the case. 
Writing to the product sectors occurs in an alternating 
way between two selected sectors. Based on the hypoth- 
esis that there is a notion of a “current” sector, we carve 
for pointers with cycle lengths 2 to 7. By making a selec- 
tion of those sequences of dumps that showed the cyclic 
behaviour, we can locate a pointer to the currently active 
transaction sector (see tsec-ptr in the shell sector of Fig- 
ure 8). This 3-bit pointer has a cycle of length 5 from 000 
to 100. In a similar way one obtains a pointer with cycle 
2, located at bit 169. Inspection of dumps reveals that 
this concerns a 3-bit pointer to the next active product 
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sector (next-psec-ptr, bits 168-170). Two other pointers 
with cycle 2 are only revealed when carving well cho- 
sen subsets of the collection of dumps. In the figure they 
are labelled with psec-ptr-A and psec-ptr-B. When step- 
ping through the dumps, it becomes clear that after each 
validation event the values of next-psec-ptr and one of 
psec-ptr-A or psec-ptr-B are swapped. When charging 
the card, psec-ptr-A and psec-ptr-B change roles. 


Cyclic external attributes. After having been able to 
locate only a single static external attribute, we continue 
by searching for dynamically stored external attributes. 
By using cycle length 5, we can find two locations in 
each of the transaction sectors related to the date of the 
most recent validation In Figure 8 these locations are la- 
belled with “date” and “date 2”. By stepping through a 
sequence of dumps swiped on consecutive days, it be- 
comes clear that the date field is a counter. It counts the 
number of days since 1/1/1997. In our dump set the two 
dates are always identical. In a similar way we can find 
two fields related to the time of the most recent validation 
event. They count the number of minutes since midnight. 
The first and second time are different, but, surprisingly, 
their difference is not constant, which would have indi- 
cated a relation to the expiration time. The last attribute 
that can be located in the transaction sector is the reader 
ID. As explained, we use the license plate of the bus and 
the location of the reader within the bus to identify each 
card reader. By combining these two attributes we obtain 
a new attribute that relates to the reader ID. Surprisingly, 
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this new attribute does not occur in the dumps, but the 
license plate attribute does. This means that all readers 
in a given bus have the same id. When interpreting the 
reader as a decimal number, one typically obtains num- 
bers in the range from 1 to 150 for readers in a bus and 
from 10150 to 10200 for readers in a train station. This 
is consistent with carving for the attribute “bus-or-train’’, 
which points at the higher bits of the reader id. 

These attributes were found by reducing cyclic at- 
tributes to static attributes as described in Section 5. With 
this approach an attribute of cycle 5 will change its value 
only every 5 dumps. As a consequence, this attribute has 
a rather slow convergence rate. Convergence can be im- 
proved, however, by focusing on the active transaction 
sector. In order to do this we created a new set of dumps, 
each of which only contained the active transaction sec- 
tor of the old dump. Carving for the static external at- 
tributes in this new set of dumps results in the same find- 
ings, but the attributes can be located with significantly 
fewer dumps. 

Using this approach we can easily locate three more 
attributes in the product sectors: the card type, the num- 
ber of rides left on the card and the expiration time of 
the current product. A second field related to the number 
of rides left was also located (rides left 2 in the figure), 
which equals 12 minus rides left for 10-rides cards and 3 
minus rides left for 1-ride cards. 


Finding high entropy attributes. While using the 
tool, one quickly observes that the diss function returns 
intervals of varying widths sliding through the index set 
of the dumps. Heuristically, one expects the width of 
these sliding windows to be shorter over intervals cor- 
responding to high-entropy attributes than over indexes 
corresponding to low entropy attributes. Furthermore, 
the step size or distance between two such windows is 
expected to be smaller for high-entropy intervals. 

The observation of short-step narrow sliding windows 
led to the conjecture that the cards contain cryptographic 
data. 

To confirm the existence of high-entropy attributes, the 
MDS hash of the dumps was computed and added as an 
attribute. The hash serves as a quick indicator for equal- 
ity or inequality of two dumps and is a more robust ap- 
proach to labeling distinct dumps with different attribute 
values than simply enumerating all dumps in a set. Carv- 
ing for this artificial MDS attribute amounts to looking 
for attribute values which change whenever the contents 
of the dump change. The tool thus revealed an 80-bit 
string in the shell sector. The same method applied to 
dumps of the product and transaction sectors revealed 
16-bit strings which only change when the data in the 
corresponding sector changes. 

Whereas an 80-bit string was expected to be a cryp- 
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tographic hash, the 16-bit strings were suspected to be 
checksums such as CRCs. By trying out a list of com- 
monly used CRCs to the data in the product and transac- 
tion sectors, the CRC-16-ANSI with polynomial «1° + 
x! + 2? + 1 was found to produce the observed values. 

This step led to the suspicion that a CRC might also 
be part of the 80 bit string in the shell sector, which was 
indeed found to be the case. The remaining 64 bits are 
expected to be a cryptographic hash protecting the in- 
tegrity of the card’s data. 


Evaluation. Our tool performed quite well in this case 
study. We located the attributes as displayed in Figure 8 
and have been able to infer the encoding scheme for most 
of them. On the other hand, we have not been able to lo- 
cate all collected attributes. We did not find the date, 
time and location of charging, the card charger id, the 
bus number and the bus stop. Our experiments prove that 
they are not stored in a static or cyclic way on the card. 
We may assume that if the date and time of charging and 
the card charger id were represented in the card’s mem- 
ory, they would have been encoded in the same way as 
the other dates, times and ids. A search of these encoded 
values in the binary dumps did not give a hit. There- 
fore, we conjecture that these attributes are not stored on 
the card, not even at a dynamically determined location. 
Given that a validated ride allows for unrestricted travel 
through the whole country for two hours, there is also no 
need to store the bus number and bus stop on the card. 

As a consequence of carving for internal attributes we 
have not only located four pointers, but we have also re- 
verse engineered part of the dynamics of updating e-go 
cards. The transaction sectors are written to cyclically. 
They contain data related to the history of the card. The 
current state of each of the products on the card is stored 
in the product sectors. Every product is assigned to one 
sector, except the currently active product. This product 
is updated alternatingly in two sectors. This redundancy 
is probably built in to keep a consistent product state even 
if a transaction does not finish successfully. 

More safeguards against update errors are found in the 
frequent checksums that we have been able to locate. A 
protection against intentional modification of the stored 
data is the cryptographic seal in the shell sector. 

Even though we found the majority of observed at- 
tributes, there are still locations in the card’s memory 
that we have not been able to assign a meaning to. Of 
course, the current dump set provides no information on 
the meaning of the constant (blue) bits in Figure 8. The 
variant (red) bits either have to do with the internal orga- 
nization of the card or with attributes that we did not or 
could not observe. 

With respect to convergence, we see that the dumps in 
this case study behave slightly worse than the dumps in 
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the idealized set from Section 7.2. Finding an attribute 
requires roughly 12 dumps (or 5 bundles). 

Occasionally, we incorrectly entered an attribute 
value. The algorithms that we developed are not robust 
against such mistakes, since a single modification in the 
input can drastically change the output. In practice, how- 
ever, such mistakes were quickly identified by regularly 
performing experiments on a subset of the dump set, such 
as all dumps belonging to a given card. 

A very useful feature of our methodology is that in the 
search for an attribute we do not presuppose a particular 
encoding of that attribute. This allowed us to search for 
the combination of license plate number and reader loca- 
tion in order to find the reader ID. Similarly, we found a 
rides left counter counting down and one that counts up 
while searching for one attribute. 


9 Conclusion and future work 


We have defined the carving problem for attributed dump 
sets as the problem of recovering the attribute mapping 
and encoding of attributes in a dump. We have pro- 
posed algorithms for recovering the attribute mapping 
and proven their correctness. The first algorithm com- 
putes the commonalities to determine the positions in a 
dump that cannot be contained in the mapping. The sec- 
ond algorithm computes subset-minimal dissimilarities 
to give a lower-bound on the bits that need to be con- 
tained in the attribute mapping. By combining these two 
algorithms, a set of possible mappings is derived. 

In order to validate our approach we have imple- 
mented a prototype, called mCarve, with commonality 
and dissimilarity algorithms. A case study performed on 
data from the electronic fare collection system in Luxem- 
bourg showed that mCarve is valuable in analyzing real- 
world systems. Using mCarve, we have located more 
than a dozen attributes on the e-go card as well as their 
encoding. We have also partly reverse engineered the dy- 
namics of updating e-go cards. 

There are several research directions that remain to be 
explored. To be able to understand the attribute values, 
the encoding has to be recovered as well. In our case 
study, we have recovered the encoding of attributes man- 
ually, while automatic approaches should in some cases 
be feasible. Heuristic approaches seem most viable, pos- 
sibly approaches based on file carving techniques. Sec- 
ondly, the robustness of our algorithms can be improved. 
Currently, a small error in the data, due to, for instance, a 
transmission error or a mistake in inputting the attribute 
value will make the results unreliable. Although these 
mistakes can be found by hand, an automatic way would 
be preferable. 

We would like to apply mCarve to other case stud- 
ies. An interesting application would be the memory of 


USENIX Association 


a cell phone. Our performance results show that we have 
to optimize the implementation of our algorithms to an- 
alyze cell phone dumps. Another use of mCarve will 
be to analyze proprietary communication protocols. By 
recording the data and applying our algorithms, we could 
reconstruct their specification. 
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A Proofs 


Proof of Lemma 1. We show the first property by con- 
tradiction. Assume that there exists an attribute a such 
that f(a) Z comm(a,S). Then there exists an index 
i € f(a) such that i ¢ common(a, S). It follows from 
the definition of comm that there is a bundle that con- 
tains bit strings s and s’ such that s; # si. However, 
since f is an attribute mapping, index i € f(a), and 
vala(s) = vala(s’), we have that s; = si. Thus, f(a) 
must be a subset of comm(a, S). 

The second property follows from the fact that if we 
extend an encoding, it remains an encoding. We know 
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that e(val,(a)) = s| f(a) is an encoding for attribute 
mapping f. By definition of attribute mapping, the map 
e'(vals(a)) = s| f(a) is an encoding as long as for all 
j © f'(a) we have s; = s/ if vala(s) = vala(s’) and 
e’ is injective. The former follows from the assumption 
that 7 € comm(a,S). The latter follows from the fact 
that extending the range of the encoding maintains the 
injectivity of it. Hence, f’(a) > I, is an attribute map- 
ping. Oo 


Proof of Lemma 2. Leta € A andlet s, s’ € S, such that 
vala(s) # vala(s’). From the definition of an attribute 
mapping and injectivity of encoding functions, we derive 
that s| (a) A 8’|f(a)- Therefore, we can find i € f(a), 
such that s; # si, and thus f(a) satisfies the definition 
of diss(a, S'). O 


Proof of Lemma 3. We define Q = {p € P | Vp' € 
P: p'’ Cp => p' = p} and prove that this is the 
required set. From the definition of Q it follows directly 
that Q is subset minimal. 

The inclusion Q C P follows directly from Q C P. 
For the converse, P C Q, we use the fact that strict 
set inclusion on P(£’) is well-founded for finite F’. Let 
p € P, then there exists p’ € P, such that p’ C p. We 
consider two cases: p’ € Q and p’ ¢ Q. If p’ € Q, 
then from p’ C p it follows that p € Q, as required. 
In the second case, p’ ¢ Q, we use the definition of 
Q to find p” € P such that p” ¢ p’. Again, we can 
consider two cases: p” € Q and p” ¢ Q. In the first 
case, p’ € Q we have p” C p’ C p, sop € Q, as 
required. In the second case we can repeat this con- 
struction to find p’” C p” C p’ C p. Given well- 
foundedness, it will be impossible to create an infinite 
sequence in this way. Therefore, there is a point where 
the loop will be broken by finding p“) € Q, such that 
p® ¢ p*Y) ©... ¢ p' C p, which implies that 
ped. 

Finally, we prove uniqueness. Assume that X and Y 
are two subset-minimal sets with X 4 Y and X = P = 
Y. Without loss of generality, we may assume that there 
exists x € X, such that x ¢ Y. We derive a contradiction 
and conclude X = Y as follows. If x € X, then x € Y. 
From z ¢ Y, we find y € Y, such that y € x. From 
y € Y, it follows that y € X, so there exists 2’ € X 
with x’ C y. Thus, we have 2’ C y C x fora’,x € X, 
which contradicts the assumption of subset minimality of 
XxX. O 


Proof of Lemma 4. By Lemma 3, let T’ be the unique 
subset-minimal set for which T’ = diss(a,$). We show 
that T C diss(a, 5”). 

Let I € T. Then by definition, Vs,s’ € 
S: (vala(s) A vala(s’) = > Hi ET: 5; A si). But 
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since S’ C S, the statement holds in particular for any 


two dumps in S’. Thus I € diss(a, S’). oO 
Proof of Lemma 5. The inclusion 
filter(diss(a, S), comm(a, S)) Cc 
filter(diss(a, R), comm(a, S')) follows from Lemma 4. 
For the reverse inclusion, let J € 


filter(diss(a, R),comm(a,S')) be an index set in 
the filtration of diss(a, R) with respect to the common 
set of the attribute a of the dumps in S. 


Suppose towards a contradiction that I ¢ 
filter(diss(a, S), comm(a, 5’). Then there must 
be dumps s1,s2 € S such that s1;|7 = sl;, but 


vala(s1) 4 vala(s2). 

Consider representatives r1,r2 € R of s, and so 
such that vala (71) = vala(s1) 4 vala(s2) = vala(r2). 
Since J C comm(a,S), it follows that ri]; = s1|r, 
ro|r = S2|7, but vala(r1) A vala(r2). This contradicts 
I € diss(a, R). O 


Proof of Theorem 2. By Lemma _ 5, it suf- 
fices to prove smin(diss(a, R|comm(a,s))) = 
smin(filter(diss(a, R), comm(a,S$))). 


The inclusion smin(diss(a, R|comm(a,s))) Cc 
filter(diss(a, R), comm(a, S’)) holds, since 
diss(a, Rl-omm(a,s)) Cc diss(a,R) and 
smin(diss(a, R|comm(a,s))) G comm(a, S). 

The inclusion diss(a, Rl comm(a,s)) > 
smin(filter(diss(a, R),comm(a,S))) holds as fol- 


lows. Let J € smin(filter(diss(a, R),comm(a, $))). 
Then J € diss(a,R) and J C comm(a,5’), thus 
Pe diss(a, A esata Bh): 

The Lemma now follows by uniqueness of subset 
minimal sets (Lemma 3) and the facts that the dissim- 
ilarity sets and filters of dissimilarity sets are superset 
closed. O 


Proof of Lemma 7. T is interval-subset-minimal by def- 
inition. It is obvious that T C diss(a,S') N Zp. Since 
diss(a, S') NZ, is interval-superset closed, it follows that 
TATn C diss(a, $)NZ,. Furthermore, for all i € [0, 7), 
if iv(a, $)(i) exists, then iv(a, S)(i) € TN TIn. 

Suppose towards a contradiction that TMZ, ¢€ 
diss(a, S') 1 Z,. Then there exists J € diss(a,S) 1 Zn 
such that ¢ TOZ,. Let I = [io,%1] and con- 
sider iv(a,S)(i9). By definition of iv(a,S), we have 
iv(a,S')(i9) C JI and we know that iv(a,S)(i9) € 
T OILp. This contradicts I ¢ TN Zp. oO 


Proof of Theorem 3. We first prove correctness of the al- 
gorithm and then compute its time complexity. 
Correctness. Let k = maXje(1,/Rl] (k;)]. By 
Lemma 7, to prove correctness of the algorithm, we need 
to show that for any two dumps s, s’ € R there exists an 
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index ind € [i,i + k] such that sing A s/,,4. We show 
this by iterating over the sorted list of dumps. 

Dump s“) differs from all other dumps within the 
interval [i,k,] because it differs from s) within 
this interval and the dump list is sorted. Assum- 
ing that for all 7 < jo dump s\) differs from all 
other dumps within the interval [¢, maa(k1,...,k;)| 
we show that dump jo differs from all other dumps 
within the interval [i, max(k1,...,kj)|. First s°) dif- 
fers from s9) < 0) on [i, max(ki,...,kjg)| since 
[t, max(ky,...,k;)] C (i, max(k1,...,kjo)]. The 
dumps s“) > 5%) differ from s%) within the inter- 
val [i, kj,] because s%°) differs from s9°*+) within this 
interval and the dump list is sorted. Thus the algorithm 
correctly computes iv(a, R)(t). 

Complexity. The complexity of the algorithm is given 
by the complexity to sort the dump set and the com- 
plexity to compare adjacent dumps in the sorted list. 
The bit-complexity for comparing the adjacent dumps 
s), 89+) is kj. Thus, in the worst case, it is bounded 
by n, the bit length of the dump. Thus iv(a, R)(i) can be 
computed in time O((n — 7)|R| + (n — t)|R]| log |R|) = 
O((n — )|R| log |R)). 

If iv(a, R)(z) is computed for all 7 € [0, 7), the sort- 
ing complexity for 2 > O can be lowered by taking ad- 
vantage of the sorted list of dumps with respect to >;_1. 
We merely need to perform a merge-sort for <; on two 
sets given by the restrictions s;_1; = 0 and s;_; = 1 and 
ordered with respect to <;_1. This can be performed in 
time O((n — 7)|R|). By summing up the time it takes to 
compute iv(a, R)(i) for i € [0,n) we obtain the theo- 
rem. O 
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Abstract 


The availability of off-the-shelf exploitation toolkits for 
compromising hosts, coupled with the rapid rate of 
exploit discovery and disclosure, has made exploit or 
vulnerability-based detection far less effective than it 
once was. For instance, the increasing use of metamor- 
phic and polymorphic techniques to deploy code injec- 
tion attacks continues to confound signature-based de- 
tection techniques. The key to detecting these attacks 
lies in the ability to discover the presence of the injected 
code (or, shellcode). One promising technique for do- 
ing so is to examine data (be that from network streams 
or buffers of a process) and efficiently execute its con- 
tent to find what lurks within. Unfortunately, current ap- 
proaches for achieving this goal are not robust to eva- 
sion or scalable, primarily because of their reliance on 
software-based CPU emulators. In this paper, we ar- 
gue that the use of software-based emulation techniques 
are not necessary, and instead propose a new framework 
that leverages hardware virtualization to better enable the 
detection of code injection attacks. We also report on 
our experience using this framework to analyze a corpus 
of malicious Portable Document Format (PDF) files and 
network-based attacks. 


1 Introduction 


In recent years, code-injection attacks have become a 
widely popular modus operandi for performing mali- 
cious actions on network services (e.g., web servers and 
file servers) and client-based programs (e.g., browsers 
and document viewers). These attacks are used to deliver 
and run arbitrary code (coined shellcode) on victims’ 
machines, often enabling unauthorized access and con- 
trol of the machine. In traditional code-injection attacks, 
the code is delivered by the attacker directly, rather than 
already existing within the vulnerable application, as in 
return-to-libc attacks. Depending on the specifics of the 


USENIX Association 


vulnerability that the attacker is targeting, injected code 
can take several forms, including source code for an in- 
terpreted scripting-language, intermediate byte-code, or 
natively-executable machine code [17]. 

Typically, though not always, the vulnerabilities ex- 
ploited arise from the failure to properly define and re- 
ject improper input. These failures have been exploited 
by several classes of code-injection techniques, includ- 
ing buffer overflows [24], heap spray attacks [7, 36], and 
return oriented programming (ROP)-based attacks [3]. 
One prominent and contemporary example embodying 
these attacks involves the use of popular, cross-platform 
document formats, such as the Portable Document For- 
mat (PDF), to help compromise systems [37]. 

Malicious PDF files started appearing on the Internet 
a few years ago, and their rise steadily increased around 
the same time that Adobe Systems published their PDF 
format specifications [34]. Irrespective of when they 
first appeared, the reason for their rise in popularity as 
a method for compromising hosts is obvious: PDF is 
supported on all major operating systems, it supports a 
bewildering array of functionality (e.g., Javascript and 
Flash), and some applications (e.g., email clients) render 
them automatically. Moreover, the “stream objects” in 
PDF allow many types of encodings (or “filters” in the 
PDF language) to be used, including multi-level com- 
pression, obfuscation, and even encryption. 

It is not surprising that malware authors quickly re- 
alized that these features can be used for nefarious pur- 
poses. Today, malicious PDFs are distributed via mass 
mailing, targeted email, and drive-by downloads [32]. 
These files carry an infectious payload that may come 
in the form of one or more embedded executables within 
the file itself', or contain shellcode that, after successful 
exploitation, downloads additional components. 

The key to detecting these attacks lies in accurately 
discovering the presence of the shellcode in network 
payloads (for attacks on network services) or process 
buffers (for client-based program attacks). This, how- 
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ever, is a significant challenge because of the prevalent 
use of metamorphism (i.e., the replacement of a set of 
instructions by a functionally-equivalent set of different 
instructions) and polymorphism (i.e., a similar technique 
that hides a set of instructions by encoding—and later 
decoding—them), that allows the shellcode to change its 
appearance significantly from one attack to the next. 

In this paper, we argue that a promising technique for 
detecting shellcode is to examine the input—be that net- 
work streams or buffers from a process—and efficiently 
execute its content to find what lurks within. While this 
idea is not new, we provide a novel approach based on 
a new kernel, called She1108S, built specifically to ad- 
dress the shortcomings of current analysis techniques 
that use software-based CPU emulation to achieve the 
same goal (e.g., [6, 8, 13, 25, 26, 43]). Unlike these ap- 
proaches, we take advantage of hardware virtualization 
to allow for far more efficient and accurate inspection of 
buffers by directly executing instruction sequences on the 
CPU. In doing so, we also reduce our exposure to evasive 
attacks that take advantage of discrepancies introduced 
by software emulation. 

The remainder of the paper is organized as follows. 
We first present background information and related 
work in §2. Next, we discuss the challenges facing 
emulation-based approaches in 83. Our framework for 
supporting the detection and forensic analysis of code 
injection attacks is presented in §4. We provide a perfor- 
mance evaluation, as well as a case study of real-world 
attacks, in §5. Limitations of our current design are dis- 
cussed in §6. Finally, we conclude in §7. 


2 Background and Related Work 


Early solutions to the problems facing signature-based 
detection systems attempted to find the presence of mali- 
cious code (for example, in network streams) by search- 
ing for tell-tale signs of executable code. For instance, 
Toth and Kruegel [38] applied a form of static analysis, 
coined abstract payload execution, to analyze the exe- 
cution structure of network payloads. While promising, 
Fogla et al. [9] showed that polymorphism defeats this 
detection approach. Moreover, the underlying assump- 
tion that shellcode must conform to discernible structure 
on the wire was shown by several researchers [19, 29, 42] 
to be unfounded. 

Going further, Polychronakis et al. [26] proposed the 
use of dynamic code analysis using emulation techniques 
to uncover shellcode in code injection attacks target- 
ing network services. In their approach, the bytes off 
the wire from a network tap are translated into assem- 
bly instructions, and a simple software-based CPU em- 
ulator employing a read-decode-execute loop is used to 
execute the instruction sequences starting at each byte 
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offset in the inspected input. The sequence of instruc- 
tions starting from a given offset in the input is called 
an execution chain. The key observation is that to be 
successful, the shellcode must execute a valid execution 
chain, whereas instruction sequences from benign data 
are likely to contain invalid instructions, access invalid 
memory addresses, cause general protection faults, etc. 
In addition, valid malicious execution chains will exhibit 
one or more observable behaviors that differentiate them 
from valid benign execution chains. Hence, a network 
stream can be flagged as malicious if there is a single 
execution chain within the inspected input that does not 
cause fatal faults in the emulator before malicious be- 
havior is observed. This general notion of network-level 
emulation has proven to be quite useful, and has garnered 
much attention of late (e.g., [13, 25, 41, 43]). 

Recently, Cova et al. [6] and Egele et al. [8] extended 
this idea to protect web browsers from so-called “heap- 
spray” attacks, where an attacker coerces an application 
to allocate many objects containing malicious code in or- 
der to increase the success rate of an exploit that jumps 
to locations in the heap [36]. These attacks are partic- 
ularly effective in browsers, where an attacker can use 
JavaScript to allocate many malicious objects [4, 35]. 
Heap spraying has been used in several high profile at- 
tacks on major browsers and document readers. Several 
Common Vulnerabilities and Exposure (CVE) disclo- 
sures have been released about these attacks in the wild. 
To the best of our knowledge, all the aforementioned ex- 
ploit detection approaches employ software-based CPU 
emulators to detect shellcode in heap objects. 

Finally, we note that although runtime analysis of pay- 
loads using software-based CPU emulation techniques 
has been successful in detecting exploits in the wild [8, 
27], the use of software emulation makes them suscepti- 
ble to multiple methods of evasion [18, 21, 33]. More- 
over, as we show later, software emulation is not scal- 
able. Our objective in this paper is to forgo software- 
based emulation altogether, and explore the design and 
implementation of components necessary for robust de- 
tection of code injection attacks. 


3 Challenges for Software-based CPU 
Emulation Detection Approaches 


As alluded to earlier, prior art in detecting code injec- 
tion attacks has applied a simple read-decode-execute ap- 
proach, whereby data is translated into its corresponding 
instructions, and then emulated in software. Obviously, 
the success of such approaches rests on accurate software 
emulation; however, the instruction set for modern CISC 
architectures is very complex, and so it is unlikely that 
software emulators will ever be bug free [18]. 
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AS a case-in-point, the popular and actively developed 
QEMU emulator [2], which employs more advanced em- 
ulation techniques based on dynamic binary translation, 
does not faithfully emulate the FPU-based Get Program 
Counter (GetPC) instructions, such as £nstenv 7. Con- 
sequently, some of the most commonly used code in- 
jection attacks fail to execute properly, including those 
encoded with Metasploit’s popular “shikata ga nai” en- 
coder and three other encoders from its arsenal that rely 
on this GetPC instruction to decode their payload. While 
this may be a boon to QEMU users employing it for full- 
system virtualization (as one rarely requires a fully faith- 
ful £nstenv implementation for normal application us- 
age), using this software emulator as-is for injected code 
detection would be fairly ineffective. In fact, we aban- 
doned our earlier attempts at building a QEMU-based de- 
tection system for exactly this reason. 

To address accurate emulation of machine instructions 
typically used in code injection attacks, special-purpose 
CPU emulators (e.g. nemu [28], libemu [1]) were 
developed. Unfortunately, they suffer from a different 
problem: large subsets of instructions rarely used by in- 
jected code are skipped when encountered in the instruc- 
tion stream. The result is that any discrepancy between 
an emulated instruction and the behavior on real hard- 
ware potentially allows shellcode to evade detection by 
altering its behavior once emulation is detected [21, 33]. 
Indeed, the ability to detect emulated enviroments is al- 
ready present in modern exploit toolkits. 

Arguably, a more practical limitation of emulation- 
based detection is that of performance. When this ap- 
proach is used in network-level emulation, for example, 
the overhead can be non-trivial since (i) the vast major- 
ity of network streams will contain benign data, some of 
which might be significant in size, (ii) successfully de- 
tecting even non-sophisticated shellcode can require the 
execution of thousands of instructions, and (iii) a sepa- 
rate execution chain must be attempted for each offset in 
a network stream because the starting location of injected 
code is unknown. 

To avoid these obstacles, the current state of practice is 
to limit run-time analysis to the first n bytes (e.g., 64kb) 
of one side of a network stream, to examine flows to 
only known servers or from known services, or to termi- 
nate execution after some threshold of instructions (e.g., 
2048) has been reached [25, 27, 43]. It goes without say- 
ing that imposing such stringent run-time restrictions in- 
evitably leads to the possibility of missing attacks (e.g., 
in the unprocessed portions of streams). 

One might argue that more advanced software-based 
emulation techniques such as dynamic binary transla- 
tion [30] could offer significant performance enhance- 
ments over the simple emulation used in current state-of- 
the-art dynamic shellcode detectors. However, the per- 
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formance benefit of dynamic binary translation hinges 
on the assumption that code blocks are translated once, 
but executed many times. While this assumption holds 
true with typical application usage, executing random 
streams of data (as in network-level emulation) results in 
short instruction sequences ending in a fault, rather than 
a structured program flow. Furthermore, dynamic binary 
translation still has the problem of emulation accuracy. 
Lastly, it is common for software-based CPU em- 
ulation techniques to omit processing of some exe- 
cution chains as a performance-boosting optimization 
(e.g., only executing instruction sequences that contain a 
GetPC instruction, or skipping an execution chain if the 
starting instruction was already executed during a previ- 
ous execution chain). Unfortunately, such optimizations 
are unsafe, in that they are susceptible to evasion. For in- 
stance, in the former case, metamorphic code may evade 
detection by, for example, pushing data representing a 
GetPC instruction to the stack and then executing it. 


begin snippet 


0 exit: 

1 inal, Ox7 ; Chain 1 

2 mov eax, OxFF ; Chain 2 begins 
3 mov ebx, 0x30 ; Chain 2 

4 cmp eax, OxFF 7; Chain 2 

5 je exit ; Chain 2 ends 

6 mov eax, fs: [ebx] ; Chain 3 begins 


end snippet 


Figure 1: Sample instruction sequence 


In the latter case, consider the sequence shown in Fig- 
ure |. The first execution chain ends after a single priv- 
ileged instruction. The second execution chain executes 
instructions 2 to 5 before ending due to a conditional 
jump to a privileged instruction. Now, since instructions 
3, 4, and 5 were already executed in the second execu- 
tion chain they are skipped (as a beginning offset) as a 
performance optimization. The third execution chain be- 
gins at instruction 6 with an access to the Thread Envi- 
ronment Block (TEB) data structure to the offset speci- 
fied by ebx. Had the execution chain beginning at in- 
struction 3 not been skipped, ebx would be loaded with 
0x30. Instead, ebx is now loaded with a random value 
set by the emulator at the beginning of each execution 
chain. Thus, if detecting an access to the memory loca- 
tion at fs: [0x30] is critical to detecting injected code, 
the attack will be missed. 


4 Our Approach: SHELLOS 


Unlike prior approaches, we take advantage of the ob- 
servation that the most widely used heuristics for shell- 
code detection exploit the fact that, to be successful, the 
injected shellcode typically needs to read from memory 
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(e.g., from addresses where the payload has been mapped 
in memory, or from addresses in the Process Environ- 
ment Block (PEB)), write the payload to some memory 
area (especially in the case of polymorphic shellcode), 
or transfer flow to newly created code [16, 22, 23, 25-— 
28, 41, 43]. For instance, the execution of shellcode of- 
ten results in the resolution of shared libraries (DLLs) 
through the PEB. Rather than tracing each instruction 
and checking whether its memory operands can be clas- 
sified as “PEB reads,” we allow instruction sequences to 
execute directly on the CPU using hardware virtualiza- 
tion, and only trace specific memory reads, writes, and 
executions through hardware-supported paging mecha- 
nisms. 

Our design for enabling hardware-support of code in- 
jection attacks is built upon a virtualization solution [12] 
known as Kernel-based Virtual Machine (KVM). We use 
the KVM hypervisor to abstract Intel VT and AMD-V 
hardware virtualization support. At a high level, the 
KVM hypervisor is composed of a privileged domain 
and a virtual machine monitor (VMM). The privileged 
domain is used to provide device support to unprivileged 
guests. The VMM, on the other hand, manages the phys- 
ical CPU and memory and provides the guest with a vir- 
tualized view of the system resources. 

In a hardware virtualized platform, the VMM only 
mediates processor events (e.g., via instructions such 
as VMEntry and VMExit on the Intel platform) that 
would cause a change in the entire system state, such as 
physical device IO, modifying CPU control registers, etc. 
Therefore, it no longer emulates guest instruction execu- 
tions as with software-based CPU emulation; execution 
happens directly on the processor, without an interme- 
diary instruction translation. We take advantage of this 
design to build a new kernel, called She110S, that runs 
as a guest OS using KVM and whose sole task is to de- 
tect and analyze code injection attacks. The high-level 
architecture is depicted in Figure 2. 





4.1 The SHELLOS Interface 


She110S can be viewed as a black box, wherein a buffer 
is supplied to She110S by the privileged domain for in- 
spection via an API call. She110S performs the anal- 
ysis and reports (1) if injected code was found, (2) the 
location in the buffer where the shellcode was found, and 
(3) a log of the actions performed by the shellcode. 

A library within the privileged domain provides the 
Shel10S API call, which handles the sequence of ac- 
tions required to initialize guest mode via the KVM 
ioctl interface. One notable feature of initializing 
guest mode in KVM is the assignment of guest phys- 
ical memory from a userspace-allocated buffer. We 
use this feature to satisfy a critical requirement — that 
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is, efficiently moving buffers into She110S for analy- 
sis. Since offset zero of the userspace-allocated mem- 
ory region corresponds to the guest physical address of 
0x0, we can reserve a fixed memory range within the 
guest address space where the privileged domain library 
writes the buffers to be analyzed. These buffers are then 
directly accessible to the She110S guest at the pre- 
defined physical address. 

The privileged domain library also optionally allows 
the user to specify a process snapshot for She110S to 
use as the default environment. The details about this 
snapshot are given later in $4.5, but for now it is suf- 
ficient to note that the intention is to allow the user to 
analyze buffers in an environment as similar as possible 
to what the injected code would expect. For example, 
a user analyzing buffers extracted from a PDF process 
may provide an Acrobat Reader snapshot, while one an- 
alyzing Flash objects might supply an Internet Explorer 
snapshot. While malicious code detection may typically 
occur without this extra data, it provides a realistic envi- 
ronment for our post facto diagnostics. 

When the privileged domain first initializes 
Shell0OS, it completes its boot sequence (detailed 
next) and issues a VMExit. When the She110S API 
is called to analyze a buffer, it is copied to the fixed 
shared region before a VMEnter is issued. Shel110S 
completes its analysis and writes the result to the shared 
region before issuing another VMExit, signaling that 
the kernel is ready for another buffer. Finally, we build 
a thread pool into the library where-in each buffer to be 
analyzed is added to a work queue and one of n workers 
dequeues the job and analyzes the buffer in a unique 
instance of She110S. 








4.2 The SHELLOS Kernel 


To set up our execution environment, we initialize the 
Global Descriptor Table (GDT) to mimic a Windows en- 
vironment. More specifically, code and data entries are 
added for user and kernel modes using a flat 4GB mem- 
ory model, a Task State Segment (TSS) entry is added 
that denies all usermode IO access, and a special en- 
try that maps to the virtual address of the Thread En- 
vironment Block (TEB) is added. We set the auxiliary 
FS segment register to select the TEB entry, as done by 
the Windows kernel. Therefore, regardless of where the 
TEB is mapped into memory, code (albeit benign or ma- 
licious) can always access the data structure at FS: [0]. 
This “feature” is commonly used by injected code to find 
shared library locations, and indeed, access to this region 
of memory has been used as a heuristic for identifying 
injected code [28]. 

Virtual memory is implemented with paging, and mir- 
rors that of a Windows process. Virtual addresses above 
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Figure 2: Architecture for detecting code injection attacks. The She110S platform includes the She110S operating 
system and host-side interface for providing buffers and extending She110S with custom memory snapshots and run- 
time detection heuristics. As shown, buffers are analyzed from reassembled TCP connections collected on a network 
tap; however She110S may be used as a component in any framework that requires analysis of injected code. 


3GB are reserved for the She110S kernel. The ker- 
nel supports loading arbitrary snapshots created using 
the minidump format [20] (e.g., used in tools such as 
WinDBG). The minidump structure contains the neces- 
sary information to recreate the state of the running pro- 
cess at the time the snapshot was taken. Once all regions 
in the snapshot have been mapped, we adjust the TEB en- 
try in the Global Descriptor Table to point to the actual 
TEB location in the snapshot. 


Control Loop Recall that She110S’ primary goal is 
to enable fast and accurate detection of input contain- 
ing shellcode. To do so, we must support the ability to 
execute the instruction sequences starting at every off- 
set in the inspected input. Execution from each offset 
is required since the first instruction of the shellcode is 
unknown. The control loop in She110S is responsi- 
ble for this task. Once She110S is signaled to begin 
analysis, the fpu, mmx, xmm, and general purpose reg- 
isters are randomized to thwart injection attacks that try 
to hinder analysis by guessing fixed register values (set 
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by She110S) and end execution early upon detection 
of these conditions. The program counter is set to the 
address of the buffer being analyzed. Buffer execution 
begins when She110S transitions to usermode with the 
iret instruction. At this point, instructions are executed 
directly on the CPU in usermode until execution is inter- 
rupted by a fault, trap, or timeout. The control loop is 
therefore completely interrupt driven. 

We define a fault as an unrecoverable error in the in- 
struction stream, such as attempting to execute a privi- 
leged instruction (e.g., the in al, 0x7 instruction in 
Figure 2), or encountering an invalid opcode. The kernel 
is notified of a fault through one of 32 interrupt vectors 
indicating a processor exception. The Interrupt Descrip- 
tor Table (IDT) points all fault-generating interrupts to a 
generic assembly-level routine that resets usermode state 
before attempting the next execution chain.? 

We define a trap, on the other hand, as a recoverable 
exception in the instruction stream (e.g., a page fault re- 
sulting from a needed, but not yet paged-in, virtual ad- 
dress), and once handled appropriately, the instruction 
stream continues execution. Traps provide an opportu- 
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nity to coarsely trace some actions of the executing code, 
such as reading an entry in the TEB. To deal with in- 
struction sequences that result in infinite loops, we cur- 
rently use a rudimentary approach wherein She110S 
instructs the programmable interval timer (PIT) to gen- 
erate an interrupt at a fixed frequency. When this timer 
fires twice in the current execution chain (guaranteeing 
at least | tick interval of execution time), the chain is 
aborted. Since the PIT is not directly accessible in guest 
mode, KVM emulates the PIT timer via privileged do- 
main timer events implemented with hrtimer, which 
in turn uses the High Precision Event Timer (HPET) de- 
vice as the underlying hardware timer. This level of indi- 
rection imposes an unavoidable performance penalty be- 
cause external interrupts (e.g. ticks from a timer) cause a 
VMEXit. 

Furthermore, the guest must signal that each inter- 
rupt has been handled via an End-of-Interrupt (EOI). The 
problem here is that EOI is implemented as a physical de- 
vice IO instruction which requires a second VMExit for 
each tick. The obvious trade-off is that while a higher 
frequency timer would allow us to exit infinite loops 
quickly, it also increases the overhead associated with en- 
tering and exiting guest mode (due to the increased num- 
ber of VMExits). To alleviate some of this overhead, we 
place the KVM-emulated PIT in what is known as Auto- 
EOI mode. This mode allows new timeout interrupts to 
be received without requiring a device IO instruction to 
acknowledge the previous interrupt. In this way, we ef- 
fectively cut the overhead in half. We return later to a 
discussion on setting appropriate timer frequencies, and 
its implications for run-time performance. 

The complete She110S kernel is composed of 2471 
custom lines of C and assembly code. 











4.3 Detection 


The She110S kernel provides an efficient means to ex- 
ecute arbitrary buffers of code or data, but we also need a 
mechanism for determining if these execution sequences 
represent injected code. One of our primary contribu- 
tions in this paper is the ability to modularly use exist- 
ing runtime heuristics in an efficient and accurate frame- 
work that does not require tracing every machine-level 
instruction, or performing unsafe optimizations. A key 
insight towards this goal is the observation that existing 
reliable detection heuristics really do not require fine- 
grained instruction-level tracing, rather, coarsely tracing 
memory accesses to specific locations is sufficient. 
Towards this goal, a handful of approaches are readily 
available for efficiently tracing memory accesses; e.g., 
using hardware supported debug registers, or exploring 
virtual memory based techniques. Hardware debug reg- 
isters are limited in that only a few memory locations 


20th USENIX Security Symposium 


may be traced at one time. Our approach, based on 
virtual memory, is similar in implementation to stealth 
breakpoints [40] and allows for an unlimited number 
of memory traps to be set to support multiple runtime 
heuristics defined by an analyst. 

Recall that an instruction stream will be interrupted 
with a trap upon accessing a memory location that gen- 
erates a page fault. We may therefore force a trap to oc- 
cur on access to an arbitrary virtual address by clearing 
the present bit of the page entry mapping for that ad- 
dress. For each address that requires tracing we clear the 
corresponding present bit and set the OS reserved 
field to indicate that the kernel should trace accesses 
to this entry. When a page fault occurs, the interrupt 
descriptor table (IDT) directs execution to an interrupt 
handler that checks these fields. If the OS reserved 
field indicates tracing is not requested, then the page 
fault is handled according to the region mappings de- 
fined in the process’ snapshot. Regardless of where the 
analyzed buffers originate from (e.g., a network packet 
or a heap object) a Windows process snapshot is always 
loaded in She110S in order to populate OS data struc- 
tures (e.g., the TEB), and to load data commonly present 
(e.g., shared libraries) when injected code executes. 

When a page entry does indicate that tracing should 
occur, and the faulting address (accessible via the CR2 
register) is in a list of desired address traps (provided, for 
example, by an analyst), the page fault must be logged 
and appropriately handled. In handling a page fault re- 
sulting from a trap, we must first allow the page to be 
accessed by the usermode code, then reset the trap im- 
mediately to ensure trapping future accesses to that page. 
To achieve this, the handler sets the present bit in the 
page entry (enabling access to the page) and the TRAP 
bit in the flags register, then returns to the usermode 
instruction stream. As a result, the instruction that origi- 
nally caused the page fault is now successfully executed 
before the TRAP bit forces an interrupt. The IDT then 
forwards the interrupt to another handler that unsets the 
TRAP and present bits so that the next access to that 
location can be traced. Our approach allows for tracing 
of any virtual address access (read, write, execute), with- 
out a predefined limit on the number of addresses to trap. 


Detection Heuristics She110S, by design, is not tied 
to any specific set of behavioral heuristics. Any heuris- 
tic based on memory reads, writes, or executions can 
be supported with coarse-grained tracing. To highlight 
the strengths of She110S, we chose to implement the 
PEB heuristic proposed by Polychronakis et al. [28]. 
That particular heuristic was chosen for its simplicity, 
as well as the fact that it has already been shown to be 
successful in detecting a wide array of Windows shell- 
code. This heuristic detects injected code that parses 





USENIX Association 


the process-level TEB and PEB data structures in order 
to locate the base address of shared libraries loaded in 
memory. The TEB contains a pointer to the PEB (ad- 
dress FS: [0x30]), which contains a pointer to yet an- 
other data structure (i.e., LDR_DATA) containing several 
linked lists of shared library information. 

The detection approach given in [28] checks if 
accesses are being made to the PEB pointer, the 
LDR_DATA pointer, and any of the linked lists. To im- 
plement their detection approach, we simply set a trap on 
each of these addresses and report that injected code has 
been found when the necessary conditions are met. This 
heuristic fails to detect certain cases, but we reiterate that 
any number of other heuristics could be chosen instead. 
We leave this as future work. 


4.4 Diagnostics 


Although efficient and reliable identification of code in- 
jection attacks is an important contribution of this paper, 
the forensic analysis of the higher-level actions of these 
attacks is also of significant value to security profession- 
als. To this end, we provide a method for reporting foren- 
sic information about a buffer where shellcode has been 
detected. Again, we take advantage of the memory snap- 
shot facility discussed earlier (§ 4.5) to obtain a list of 
virtual addresses associated with API calls for various 
shared libraries. We place traps on these addresses, and 
when triggered, a handler for the corresponding call is 
invoked. That handler pops function parameters off the 
usermode stack, logs the call and its supplied parameters, 
performs actions needed for the successful completion of 
that call (e.g., allocating heap space), and then returns to 
the injected code. 

Obviously, due to the myriad of API calls available, 
one cannot expect the diagnostics to be complete. Keep 
in mind, however, that the lack of completeness in our 
diagnostics facility is independent of the actual detection 
of injected code. The ability to extend the level of diag- 
nostic information is straightforward, but tedious. That 
said, as shown later, we are able to provide a wealth of 
diagnostic information on a diverse collection of self- 
contained [27] shellcode injection attacks. 


4.5 Extensibility 


The capabilities provided by She110S are but one com- 
ponent in an overall framework necessary to detect code 
injection attacks. This larger framework should support 
the loading of custom process snapshots and arbitrary 
shellcode detection heuristics, each defined by a list of 
read, write, or execute memory traps. Since She110S 
only detects and diagnoses the buffers of data provided, 
there must be some mechanism for providing buffers of 
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data we suspect contain injected code. To this end, we 
built two platforms that rely on She110S to scan buffers 
for injected code; one to detect client-based program at- 
tacks such as the malicious PDFs discussed earlier, and 
another to detect attacks on network services that oper- 
ates as a network intrusion detection system. 


Supporting Detection of Code Injection in Client- 
based Programs: To showcase She110S’ promise as 
a platform upon which other modules can be built, we 
implemented a lightweight memory monitoring facility 
that allows She110S to scan buffers created by docu- 
ments loaded in the process space of a prescribed reader 
application. In this context, a document is any file or 
object that may be opened with it’s corresponding pro- 
gram, such as a PDF, Microsoft Word document, Flash 
object, HTML page, etc. This platform may be useful to 
an enterprise as a network service wherein documents are 
automatically sent for analysis (e.g. by extraction from 
network streams or an email server) or manually submit- 
ted by an analyst in a forensic investigation. 

The approach we take to detect shellcode in malicious 
documents is to let the reader application handle ren- 
dering of the content while monitoring any buffers cre- 
ated by it, and signaling She110S to scan these buffers 
for shellcode (using existing heuristics). This approach 
has several advantages. An important one is that we do 
not need to worry about recreating any document object 
model, handling obfuscated javascript, or dealing with 
all the other idiosyncrasies that pose challenges for other 
approaches [6, 8, 39]. We simply need to analyze the 
buffers created when rendering the document in a quar- 
antined environment. The challenge lies in doing all of 
this as efficiently as possible. 

To support this goal, we provide a monitoring facil- 
ity that is able to snapshot the memory contents of pro- 
cesses. The snapshots are constructed in a manner that 
captures the entire process state, the virtual memory lay- 
out, as well as all the code and data pages within the pro- 
cess. The data pages contain the buffers allocated on the 
heap, while the code pages contain all the system mod- 
ules that must be loaded by She110S to enable analy- 
sis. Our memory tracing facility includes less than 900 
lines of custom C/C++ code. A high level view of the 
approach is shown in Figure 3. 

This functionality was built specifically for the Win- 
dows OS and can support any application running on 
Windows. The memory snapshots are created using cus- 
tom software that attaches to an arbitrary application pro- 
cess and stores contents of memory using the function- 
ality provided by Windows’ debug library (DbgHelp). 
We capture buffers that are allocated on the heap (.e., 
pages mapped as RW), as well as thread and module in- 
formation. The results are stored in minidump format, 
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Figure 3: A platform for analyzing process buffers using 
ShelloOs 


which contains all the information required to recreate 
the process within She1108S, including all dlls, the 
PEB/TEB, register state, the heap and stack, and the vir- 
tual memory layout of these components. 





Supporting Detection of Code Injection in Network 
Services: Another use-case for She110S is detecting 
code injection attacks targeting network services. While 
the shellcode embedded in client-based program code 
injection attacks is typically obfuscated in multiple lay- 
ers of encoding (e.g. compressed form — javascript — 
shellcode), attacks on network services are often present 
directly as executable shellcode on the wire. As noted 
by Polychronakis et al. [26], we may use this observa- 
tion to build a platform to detect code injection attacks 
on network services by reassembling observed network 
streams and executing each of these streams. This plat- 
form may be used in an enterprise as a component of 
an network intrusion detection system or for post-facto 
analysis of a network capture in a forensic investigation. 


5 Evaluation 


In the analysis that follows, we first examine She110S’ 
ability to faithfully execute network payloads and suc- 
cessfully trigger the detection heuristics when shellcode 
is found. Next, we examine the performance benefits of 
the She110S framework when compared to software- 
emulation. We also report on our experience using 
She110S to analyze a collection of suspicious PDF doc- 
uments. All experiments were conducted on an Intel 
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Table 1: Off-the-Shelf Shellcode Detection. 


Xeon Quad Processor machine with 32 GB of memory. 
The host OS was Ubuntu with kernel version 2.6.35. 


5.1 Performance 


To evaluate our performance, we used Metasploit to 
launch attacks in a virtualized environment. For each 
encoder, we generated 100s of attack instances by ran- 
domly selecting 1 of 7 exploits, 1 of 9 self-contained 
payloads that utilize the PEB for shared library resolu- 
tion, and randomly generated parameter values associ- 
ated with each type of payload (e.g. download URL, bind 
port, etc.). As the attacks launched, we captured the net- 
work traffic for later network-level buffer analysis. 

We also encoded several payload instances using 
an advanced polymorphic engine, called TAPiON‘. 
TAPiON incorporates features designed to thwart emula- 
tion. Each of the encoders we used (see Table 1) are con- 
sidered to be self-contained [25] in that they do not re- 
quire additional contextual information about the process 
they are injected into in order to function properly. In- 
deed, we do not specifically address non-self-contained 
shellcode in this paper. 

For the sake of comparison, we chose a software-based 
solution (called Nemu [28]), that is reflective of the cur- 
rent state of the art. Nemu and She110S both performed 
well in detecting all the instances of the code injection at- 
tacks developed using Metasploit, with a few exceptions. 

Surprisingly, Nemu failed to detect shellcode gener- 
ated using the alpha_upper encoder. Since the en- 
coder payload relies on accessing the PEB for shared li- 
brary resolution, we expected both Nemu and She110S 
to trigger this detection heuristic. We speculate that 
Nemu is unable to handle this particular case because 
of inaccurate emulation of its particular instruction 
sequences—underscoring the need to directly execute 
the shellcode on hardware. 

More pertinent to the discussion is that while the 
software-based emulation approach is capable of de- 
tecting shellcode generated with the TAPiON engine, 
performance optimization limits its ability to do so. 
The TAPiON engine attempts to confound detection 
by basing its decoding routines on timing components 
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(namely, the RDTSC instruction) and uses a plethora of 
CPU-intensive coprocessor instructions in long loops to 
slow runtime-analysis. These long loops quickly reach 
Nemu’s default execution threshold (2048) prior to any 
heuristic being triggered. This is particularly problem- 
atic because no GetPC instruction is executed until these 
loops complete. 

Furthermore, software-based emulators simply treat 
the majority of coprocessor instructions as NOP s. While 
TAPiON does not currently use the result of these in- 
structions in its decoding routine, it only takes minor 
changes to the out-of-the-box engine to incorporate these 
results and thwart detection (hence the “*” in Table 1). 
She1108S, on the other hand, fully supports all copro- 
cessor instructions with its direct CPU execution. 

More problematic for these classes of approaches is 
that successfully detecting code encoded by engines such 
as TAPiON can require following very long execution 
chains (e.g., well over 60, 000 instructions). To examine 
the runtime performance of our prototype, we randomly 
generated 1000 benign inputs, and set the instructions 
thresholds (in both approaches) to the levels required to 
detect instances of TAPiON shellcode. 
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Figure 4: She110S Performance 


Since She110S currently cannot directly set an in- 
struction threshold (due to the coarse-grained tracing ap- 
proach), we approximate the required threshold by ad- 
justing the execution chain timeout frequency. As the 
timer frequency increases, the number of instructions ex- 
ecuted per execution chain decreases. Thus, we exper- 
imentally determined the maximum frequency needed 
to execute the TAPiON shellcodes that required 10k, 
16k, and 60k instruction executions to complete their 
loops. These timer frequencies are 5000HZ, 4000HZ, 
and 1000HZ, respectively. Note that in the common 
case, She110S can execute many more instructions, de- 
pending on the speed of individual instructions. TAPiON 


USENIX Association 


code, however, is specifically designed to use the slower 
FPU-based instructions. (She110S can execute over 4 
million fast NOP instructions in the same time interval 
that only 60k FPU-heavy instructions are executed.) 

The results are shown in Figure 4. The labeled 
points on the lineplot indicate the minimum execution 
chain length required to detect the three representative 
TAPiON samples. For completeness, we show the per- 
formance of Nemu with and without unsafe execution 
chain pruning (see §3). When unsafe pruning is used, 
software-emulation does better than She110S ona sin- 
gle core at very low execution thresholds. This is not 
too surprising, as the higher clock frequencies required to 
support short execution chains in She110S incur addi- 
tional overhead (see §4). However, with longer execution 
chains, the real benefit of She 1 10S becomes apparent— 
She1l10S (on a single core) is an order of magnitude 
faster than Nemu when unsafe execution chain pruning 
is disabled. Finally, we observe that the worker queue 
provided by the She110S host-side library efficiently 
multi-processes buffer analysis, and demonstrates that 
multi-processing offers a viable alternative to the unsafe 
elimination of execution chains. 


A note on 64-bit architectures The performance of 
ShellOS is even more compelling when one takes 
into consideration the fact that in 64-bit architectures, 
program counter relative addressing is allowed—hence, 
there is no need for shellcode to use any form of “Get 
Program Counter” code to locate its address on the stack; 
a limitation that has been widely used to detect tradi- 
tional 32-bit shellcode using (very) low execution thresh- 
olds. This means that as 64-bit architectures become 
commonplace, shellcode detection approaches using dy- 
namic analysis must resort to heuristics that require the 
shellcode to fully decode. The implications are that 
the requirement to process long execution chains, such 
as those already exhibited by today’s advanced engines 
(e.g., Hydra [29] and TAPiON), will be of far more sig- 
nificance than it is today. 


5.2 Throughput 


To better study our throughput on network streams, 
we built a testbed consisting of 32 machines running 
FreeBSD 6.0 and generated traffic using a state-of-the- 
art traffic generator, Tmix [15]. The network traffic is 
routed between the machines using Linux-based soft- 
ware routers. The link between the two routers is tapped 
using a gigabit fiber tap, with the traffic diverted to our 
detection appliance (i.e., running She110S or Nemu), 
as well as to a network monitor that constantly monitors 
the network for throughput and losses. The experimental 
setup is shown in Figure 5. 
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Figure 5: Experimental testbed with end systems generating traffic using Tmix. Using a network tap, we monitor the 
throughput on one system, while She 110S or Nemu attempt to analyze all traffic on another system. 


Tmix synthetically regenerates TCP traffic that 
matches the statistical properties of traffic observed in 
a given network trace; this includes source level prop- 
erties such as file and object size distributions, number 
of simultaneously active connections and also network 
level properties such as round trip time. Tmix also pro- 
vides a block resampling algorithm to achieve a target 
throughput while preserving the statistical properties of 
the original network trace. 

We supply Tmix with a network trace of HTTP con- 
nections captured on the border links of UNC-Chapel 
Hill in October, 2009°. The trace represents 1-hour of 
activity, which is more than long enough to capture dis- 
tributions for many statistical measures indistinguishable 
from longer traces [14]. Using Tmix block resampling, 
we run two |-hour experiments based on the original 
trace where Tmix attempts to maintain a throughput of 
100Mbps in the first experiment and 350Mbps in the sec- 
ond experiment. The actual throughput fluctuates some 
as Tmix maintains statistical properties observed in the 
original network trace. We repeat each experiment with 
the same seed (to generate the same traffic) using both 
Nemu and She110S. 

Both She110S and Nemu are configured to only ana- 
lyze traffic from the connection initiator, as we are target- 
ing code injection attacks on network services. We ana- 
lyze up to one megabyte of a network connection (from 
the initiator) and set an execution threshold of 60k in- 
structions (see section 85.1). Neither She 110S or Nemu 
perform any instruction chain pruning (e.g. we try exe- 
cution from every position in every buffer) and use only 
a single cpu core. 

Figure 6 shows the results of the network experi- 
ments. The bottom subplot shows the traffic throughput 
generated over the course of both 1-hour experiments. 
The 100Mbps experiment actually fluctuates from 100- 
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160Mbps, while the 350Mbps experiment nearly reaches 
500Mbps at some points. The top subplot depicts the 
number of buffers analyzed over time for both She110S 
and Nemu with both experiments. Note that one buffer 
is analyzed for each connection containing data from the 
connection initiator. The plot shows that the maximum 
number of buffers per second for Nemu hovers around 
75 for both the 1OOMbps and 350Mbps experiments with 
significant packet loss observed in the middle subplot. 
She110S is able to process around 250 buffers per sec- 
ond in the 100Mbps experiment with zero packet loss and 
around 750 buffers per second in the 350Mbps experi- 
ment with intermittent packet loss. That is, Shel110S 
is able to process all buffers with 1 CPU core, with- 
out loss, on a network with sustained 100Mbps network 
throughput, while She110S is on the cusp of its maxi- 
mum throughput on 1 CPU core on a network with sus- 
tained 350Mbps network throughput (and spikes up to 
500Mbps). In these tests, we received no false positives 
for either She 110S or Nemu. 


Our experimental network setup, unfortunately, is not 
currently able to generate sustained throughput greater 
than the 350Mbps experiment. Therefore, to demonstrate 
She110S’ scalability in leveraging multiple CPU cores, 
we instead turn to an analysis of the 1ibnids packet 
queue size in the 350Mbps experiment. We fix the max- 
imum packet queue size at 100k, then run the 350Mbps 
experiment 4 times utilizing 1, 2, 4, and 14 cores. When 
the packet queue size reaches the maximum, packet loss 
occurs. The average queue size should be as low as pos- 
sible to minimize the chance of packet loss due to sud- 
den spikes in network traffic, as observed in the middle 
subplot of Figure 6 for the 350Mbps She110S exper- 
iment. Figure 7 shows the CDF of the average packet 
queue size over the course of each 1-hour experiment run 
with a different number of CPU cores. The figure shows 
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Figure 6: She110S network throughput performance. 


that using 2 cores reduces the average queue size by an 
order of magnitude, 4 cores reduces average queue size 
to less than 10 packets, and 14 cores is clearly more than 
sufficient for 350Mbps sustained network traffic. This 
evidence suggests that multi-core She110S may be ca- 
pable of monitoring links with much greater throughput 
than we were able to generate in our experiments. 
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Figure 7: CDF of the average packet queue size as the 
number of She 110S CPU cores is scaled. 
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5.3. Case Study: PDF Code Injection 


We now report on our experience using this framework to 
analyze a collection of 427 malicious PDFs. These PDFs 
were randomly selected from a larger subset of suspi- 
cious files flagged by a large-scale web malware detec- 
tion system. Each PDF is labeled with a Common Vul- 
nerability Exposure (CVE) number (or “Unknown” tag). 
Of these files, 22 were corrupted, leaving us with a total 
of 405 files for analysis. We also use a collection of 179 
benign PDFs from various USENIX conferences. 
We launch each document with Adobe Reader and at- 
tach the memory facility to that process. We then snap- 
shot the heap as the document is rendered, and wait un- 
til the heap buffers stop growing. 374 of the 405 mali- 
cious PDFs resulted in a unique set of buffers. She 110S 
is then signaled that the buffers are ready for inspec- 
tion. Note that we only generate the process layout once 
per application (e.g., Reader), and subsequent snapshots 
only contain the heap buffers. 

Figure 8 shows the size distribution of heap buffers 
extracted from benign and malicious PDFs. Notice that 
= 60% of the buffers extracted from malicious PDF are 
512K long. This striking feature can be attributed to 
the heap allocation strategy used by the Windows OS, 
whereby chunks of 512K and higher are memory aligned 
at 64K boundaries. As noted by Ding et al. [7], attack- 
ers can take advantage of this alignment to increase the 
success rate of their attacks (e.g., by providing a more 
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Figure 8: CDF of sizes of the extracted buffers 


predictable landing spot for the shellcode when used in 
conjunction with large NOP-sleds). 








CVE Detected 
CVE-2007-5659 2 
CVE-2008-2992 10 
CVE-2009-4324 12 
CVE-2009-2994 1 
CVE-2009-0927 33 
CVE-2010-0188 53 
CVE-2010-2883 70 
Unknown 144 














Table 2: CVE Distribution for Detected Attacks 


Table 2 provides a breakdown of the corresponding 
CVE listings for the 325 unique code injection attacks we 
detected. Interestingly, we were able to detect 70 attacks 
using Return Oriented Programming (ROP) because of 
their second-stage exploit (CVE-2010-2883) triggering 
the PEB heuristic. We verified these attacks used ROP 
through subsequent manual analysis of the javascript in- 
cluded in the PDFs and reiterate that our current runtime 
heuristics do not directly detect ROP code, but that in 
all the examples we observed using ROP, control was al- 
ways transferred to non-ROP shellcode to perform the 
primary actions of the attack. We believe that in the fu- 
ture the flexibility of She110S’ ability to load arbitrary 
process snapshots may be leveraged to correctly execute, 
detect, and diagnose ROP by iterating the stack pointer 
(instead of the IP) over a buffer and issuing a ret in- 
struction to test every position of a buffer for ROP. This 
may be critical as attackers become more adapt at craft- 
ing ROP-only code injection attacks. 

Figure 9 depicts the CDF for extracting heap objects 
from malicious and benign documents. The time distri- 
bution for malicious documents is further broken down 
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Figure 9: Elapsed time for extracting heap objects 


by “ROP-based” (i.e., CVE-2010-2883) and other ex- 
ploits. The group labeled other performed more tradi- 
tional heap-spray attacks with self-contained shellcode, 
and is not particularly interesting (at least, from a foren- 
sic standpoint). In either case, we were able to extract 
approximately 98% of the buffers within 26 seconds. For 
the benign files, extraction took less than 5 seconds for 
98% of the documents. The low processing time of the 
benign case is because the buffers are allocated just once 
when the PDF is rendered on open, as opposed to hun- 
dreds of heap objects created by the embedded javascript 
that performs the heap-sprays. 
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Figure 10: Breakdown of average time of analysis. 


The overall time for performing our analyses is given 
in Figure 10. Notice that the majority of the time 
can be attributed to buffer extraction. Once signalled, 
She1l10S analyzes the buffers at high speed. The av- 
erage time to analyze a benign PDF (the common case, 
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hopefully) is 5.46 seconds with our unoptimized code. 

We remind the reader that the framework we provide 
is not tied to any particular method of buffer extraction. 
To the contrary, She110S executes any arbitrary buffer 
supplied by the analyst and reports if the desired heuris- 
tics are triggered. In this case-study, we simply chose to 
highlight the usefulness of She110S with buffers pro- 
vided by our own PDF pre-processor. 

Next, we describe some of the patterns we observed 
lurking within PDF-based code injection attacks. 


5.4 Forensic Analysis 


Recall that once injected code is detected, She110S 
continues to allow execution to collect diagnostic traces 
of Windows API calls before returning a result. In the 
majority of cases, the diagnostics completed successfully 
for the PDF dataset. Of the diagnostics performed in the 
other category, we found that 85% of the injected code 
exhibited an identical API call sequence: 


begin snippet 


LoadLibraryA ("urlmon") 
URLDownloadToCacheFile ( 
URL = "http:// (omitted) .cz.cc/ 
out .php?a=36&p=5", 
CacheFile = "Stmp%") 
CreateProcessA (App = "Stmp%", Cmd = (null) ) 
TerminateThread (Thread = -2, ExitCode = 0) 


end snippet 


The top level domains were always cz.cc and 
the GET request parameters varied only in numerical 
value. We also observed that all of the remaining 
PDFs in the other category (where diagnostics suc- 
ceeded) used either the URLDownloadToCacheFile 
or URLDownloadToFile API call to download a file, 
then executed it with CreateProcessA, WinExec, 
or ShellExecuteA. Two of these shellcodes at- 
tempted to download several binaries from the same do- 
main, and a few of the requested URLs contained obvi- 
ous text-based information pertinent to the exploit used, 
e.g. exp=PDF (Collab), exp=PDF (GetIcon), 
or ex=Util.Printf — presumably for bookkeeping 
in an overall diverse attack campaign. 

Two of the self-contained payloads were only partially 
analyzed by the diagnostics, and proved to be quite inter- 
esting. The partial call trace for the first of these is given 
in Figure 11. Here, the injected code allocates space on 
the heap, then copies code into that heap area. Although 
the code copy is not apparent in the API call sequence 
alone, She110S may also provide an instruction-level 
trace (when requested by the analyst) by single-stepping 
each instruction via the TRAP bit in the flags register. We 
observed the assembly-level copies using this feature. 
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The code then proceeds to patch several DLL functions, 
partially observed in this trace by the use of API calls to 
modify page permissions prior to patching, then resetting 
them after patching. Again, the assembly-level patching 
code is only observable in a full instruction trace. Finally, 
the shellcode performs the conventional URL download 
and executes that download. 


begin snippet 


GlobalAlloc(Flags = 0x0, Bytes = 8192) 
VirtualProtect (Addr = 0x7c86304a, Size = 4096, 
Protect = 0x40) 
VirtualProtect (Addr = 0x7c86304a, Size = 4096, 
Protect = 0x20) 
LoadLibraryA ("user32") 
VirtualProtect (Addr = 0x77d702d3, Size = 4096, 
Protect = 0x40) 
VirtualProtect (Addr = 0x77d702d3, Size = 4096, 
Protect = 0x20) 
LoadLibraryA("ntdll1") 
VirtualProtect (Addr = 0x7c918c2e, Size = 4096, 
Protect = 0x40) 
VirtualProtect (Addr = 0x7c918c2e, Size = 4096, 
Protect = 0x20) 
LoadLibraryA ("urlmon") 
URLDownloadToCacheFile ( 

URL = "http://www. (omitted) .net/file.exe", 

CacheFile = "Stmp%") 
CreateProcessA (App=(null), Cmd="cmd /c %tmp%") 





CCC nnd sStnippet 


Figure 11: More complex shellcode in a PDF 


The second interesting case challenges our prototype 
diagnostics by applying some anti-analysis techniques. 
The partial API call sequence observed follows: 


begin snippet 


GetFileSize(hFile = 0x4) 

GetTickCount () 

GlobalAlloc(Flags = 0x40, Bytes = 4) = buf* 
ReadFile(hFile = 0x0, Buf* = buf*, Len = 4) 
-..continues to loop in this sequence... 


CCC nnd Snippet 


Figure 12: Analysis-resistant Shellcode 


As She1llOS does not currently address context- 
sensitive code, we have no way of providing the file size 
expected by this code. Furthermore, we do not provide 
the required timing characteristics for this particular se- 
quence as our API call handlers merely attempt to pro- 
vide a ‘correct’ value, with minimal behind-the-scenes 
processing. As a result, this sequence of API calls is re- 
peated in an infinite loop, preventing further automated 
analysis. We note, however, that this particular challenge 
is not unique to She110S. 

Of the 70 detected ROP-based exploit PDFs, 87% of 
the second stage payloads adhered to the following API 
call sequence: 
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begin snippet 

LoadLibraryA ("urlmon") 
LoadLibraryA ("Shel132") 
GetTempPathA(Len = 64, 
URLDownloadToFile ( 

URL = "http:// (omitted) .php? 

spl=pdf_sing&s=0907... (omitted) ...FC2_1 

&f£h=", 

File = "C:\TEMP\a.exe") 
ShellExecuteA (File = "C:\TEMP\a.exe") 
ExitProcess (ExitCode = -2), 


end snippet 


Figure 13: Typical second stage of a ROP-based PDF 
code injection attacks observed using She110S. 


Buffer = "C:\TEMP\") 


Of the remaining payloads, 6 use an API not yet sup- 
ported in She1108S, while the others are simple variants 
on this conventional URL download pattern. 


6 Limitations 


Code injection attack detection based on run-time anal- 
ysis, whether emulated or supported through direct CPU 
execution, generally operates as a self-sufficient black- 
box wherein a suspicious buffer of code or data is sup- 
plied, and a result returned. She110S attempts to pro- 
vide a run-time environment as similar as possible to that 
which the injected code expects. That said, we cannot 
ignore the fact that shellcode designed to execute un- 
der very specific conditions may not operate as expected 
(e.g., non-self-contained [19, 26], context-keyed [11], 
and swarm attacks [5]). We note, however, that by requir- 
ing more specific processor state, the attack exposure is 
reduced, which is usually counter to the desired goal — 
that is, exploiting as many systems as possible. The same 
rational holds for the use of ROP-based attacks, which 
require specific data being present in memory. 

More specific to our framework is that we cur- 
rently employ a simplistic approach for loop detection. 
Whereas software-based emulators are able to quickly 
detect and (safely) exit an infinite loop by inspecting pro- 
gram state at each instruction, we only have the opportu- 
nity to inspect state at each clock tick. At present, the 
overhead associated with increasing timer frequency to 
inspect program state more often limits our ability to exit 
from infinite loops more quickly. In future work, we plan 
to explore alternative methods for safely pruning such 
loops, without incurring excessive overhead. 

Furthermore, while employing hardware virtualization 
to run She110S provides increased transparency over 
previous approaches, it may still be possible to detect a 
virtualized environment through the small set of instruc- 
tions that must still be emulated. We note, however, that 
while She110S currently uses hardware virtualization 
extensions to run along side a standard host OS, only im- 
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plementation of device drivers prevents She110S from 
running directly as the host OS. Running directly as the 
host OS could have additional performance benefits in 
detecting code injection for network services. We leave 
this for future work. 

Finally, She110S provides a framework for fast de- 
tection and analysis of a buffer, but an analyst or auto- 
mated data pre-processor (such as that presented in §5) 
must provide these buffers. As our own experience has 
shown, doing so can be non-trivial, as special attention 
must be taken to ensure a realistic operating environment 
is provided to illicit the proper execution of the sample 
under inspection. This same challenge holds for all VM 
or emulation-based detection approaches we are aware 
of (e.g., [6, 8, 10, 31]). Our framework can be extended 
to benefit from the active body of research in this area. 


7 Conclusion 


In this paper, we propose a new framework for en- 
abling fast and accurate detection of code injection at- 
tacks. Specifically, we take advantage of hardware virtu- 
alization to allow for efficient and accurate inspection of 
buffers by directly executing instruction sequences on the 
CPU. Our approach allows for the modular use of exist- 
ing run-time heuristics in a manner that does not require 
tracing every machine-level instruction, or performing 
unsafe optimizations. In doing so, we provide a foun- 
dation that defenses for code injection attacks can build 
upon. We also provide an empirical evaluation, spanning 
real-world attacks, that aptly demonstrates the strengths 
of our framework. 


Code Availability 


We anticipate that the source code for the She110S ker- 
nel and our packaged tools will be made available under 
a BSD license for research and non-commercial uses. 
Please contact the first author for more information on 
obtaining the software. 
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Notes 


'See, for example, “Sophisticated, targeted malicious PDF doc- 
uments exploiting CVE-2009-4324”" at http://isc.sans.edu/ 
diary.html?storyid=7867. 

2See the discussion at https://bugs.launchpad.net/ 
qemu/+bug/ 661696, November, 2010. 

3We reset registers via popa and fxrstor instructions, while 
memory is reset by traversing page table entries and reloading pages 
with the dirty bit set. 

“The TAPiON engine is available at 
specialised.info/all/tapion/ 

>We update this network trace with payload byte distributions col- 
lected in 2011. 


http://pb. 


References 


[1] P. Baecher and M. Koetter. Libemu - x86 shell- 
code emulation library. Available at http:// 
libemu.carnivore.it/, 2007. 


S 


F. Bellard. Qemu, a fast and portable dynamic 
translator. In Proceedings of the USENIX Annual 
Technical Conference, pages 41-41, Berkeley, CA, 
USA, 2005. 


oa 
OW 
fiat} 


E. Buchanan, R. Roemer, H. Shacham, and S. Sav- 
age. When Good Instructions Go Bad: General- 
izing Return-Oriented Programming to RISC. In 
ACM Conference on Computer and Communica- 
tions Security, Oct. 2008. 


[4] B. Z. Charles Curtsigner, Benjamin Livshits and 
C. Seifert. Zozzle: Fast and Precise In-Browser 
Javascript Malware Detection. USENIX Security 
Symposium, August 2011. 


[5] S. P. Chung and A. K. Mok. Swarm attacks against 
network-level emulation/analysis. In International 
symposium on Recent Advances in Intrusion Detec- 


tion, pages 175-190, 2008. 


[6] M. Cova, C. Kruegel, and V. Giovanni. Detection 
and analysis of drive-by-download attacks and ma- 
licious javascript code. In International conference 
on World Wide Web, pages 281-290, 2010. 


[7] Y. Ding, T. Wei, T. Wang, Z. Liang, and W. Zou. 
Heap Taichi: Exploiting Memory Allocation Gran- 
ularity in Heap-Spraying Attacks. In Annual 
Computer Security Applications Conference, pages 
327-336, 2010. 


a 
Co 
= 


M. Egele, P. Wurzinger, C. Kruegel, and E. Kirda. 
Defending browsers against drive-by downloads: 


USENIX Association 


Mitigating heap-spraying code injection attacks. In 
Detection of Intrusions and Malware & Vulnerabil- 
ity Assessment, June 2009. 


S 


P. Fogla, M. Sharif, R. Perdisci, O. Kolesnikov, and 
W. Lee. Polymorphic blending attacks. In USENIX 
Security Symposium, pages 241-256, 2006. 


[10 


= 


S. Ford, M. Cova, C. Kruegel, and G. Vigna. An- 
alyzing and detecting malicious flash advertise- 
ments. In Computer Security Applications Confer- 
ence, pages 363 -372, Dec 2009. 


[11] D. A. Glynos. Context-keyed Payload Encoding: 
Fighting the Next Generation of IDS. In Athens IT 
Security Conference (ATH.CON), 2010. 


[12] R. Goldberg. Survey of Virtual Machine Research. 
IEEE Computer Magazine, 7(6):34-35, 1974. 


[13] B. Gu, X. Bai, Z. Yang, A. C. Champion, and 
D. Xuan. Malicious shellcode detection with vir- 
tual memory snapshots. In International Confer- 
ence on Computer Communications (INFOCOM), 
pages 974-982, 2010. 


[14] F. Hernandez-Campos, F. Smith, and K. Jeffay. 
Tracking the evolution of web traffic: 1995-2003. 
In Proceedings of the 11th IEEE/ACM Interna- 
tional Symposium on Modeling, Analysis and Sim- 
ulation of Computer Telecommunication Systems 


(MASCOTS), pages 16-25, 2003. 


[15] F. Hernandez-Campos, K. Jeffay, and F. Smith. 
Modeling and generating TCP application work- 
loads. In /4th IEEE International Conference on 
Broadband Communications, Networks and Sys- 
tems (BROADNETS), pages 280-289, 2007. 


[16] I. Kim, K. Kang, Y. Choi, D. Kim, J. Oh, and 
K. Han. A Practical Approach for Detecting Ex- 
ecutable Codes in Network Traffic. In Asia-Pacific 
Network Ops. & Mngt Symposium, 2007. 


[17] G. MacManus and M. Sutton. Punk Ode: Hiding 
Shellcode in Plain Sight. In Black Hat USA, 2006. 


[18] L. Martignoni, R. Paleari, G. F. Roglia, and D. Br- 
uschi. Testing CPU Emulators. In Jnterna- 
tional Symposium on Software Testing and Analy- 
sis, pages 261-272, 2009. 


J. Mason, S. Small, F. Monrose, and G. MacManus. 
English shellcode. In Conference on Computer and 
Communications Security, pages 524-533, 2009. 


[19 


— 


[20] MSDN. Mindump header structures MSDN 
Library. See http://msdn.microsoft. 


20th USENIX Security Symposium =137 


138 


[21 


[22 


[23 


[24 


[26 


[27 


ry 


—“ 


“4 


] 


=“ 


= 


— 


“4 


com/en-us/library/ms680378 (VS.85) 
.aSpxX. 


R. Paleari, L. Martignoni, G. F. Roglia, and D. Br- 
uschi. A Fistful of Red-Pills: How to Automati- 
cally Generate Procedures to Detect CPU Emula- 
tors. In USENIX Workshop on Offensive Technolo- 
gies, 2009. 


A. Pasupulati, J. Coit, K. Levitt, S. RF Wu, S. H. Li, 
R. C. Kuo, and K. P. Fan. Buttercup: on Network- 
based Detection of Polymorphic Buffer Overflow 
Vulnerabilities. In JEEE/IFIP Network Op. & Mngt 
Symposium, pages 235-248, May 2004. 


U. Payer, P. Teufl, and M. Lamberger. Hybrid En- 
gine for Polymorphic Shellcode Detection. In De- 
tection of Intrusions and Malware & Vulnerability 
Assessment, pages 19-31, 2005. 


J. D. Pincus and B. Baker. Beyond stack Smashing: 
Recent Advances in Exploiting Buffer Overruns. 
IEEE Security and Privacy, 4(2):20-27, 2004. 


M. Polychronakis, K. G. Anagnostakis, and E. P. 
Markatos. Network-level Polymorphic Shellcode 
Detection using Emulation. In Detection of In- 
trusions and Malware & Vulnerability Assessment, 


pages 54-73, 2006. 


M. Polychronakis, K. G. Anagnostakis, and E. P. 
Markatos. Emulation-based Detection of Non-self- 
contained Polymorphic Shellcode. In International 
Symposium on Recent Advances in Intrusion Detec- 


tion, 2007. 


M. Polychronakis, K. G. Anagnostakis, and E. P. 
Markatos. An Empirical Study of Real-world Poly- 
morphic Code Injection Attacks. In USENIX 
Workshop on Large-Scale Exploits and Emergent 
Threats, 2009. 


M. Polychronakis, K. G. Anagnostakis, and E. P. 
Markatos. Comprehensive shellcode detection us- 
ing runtime heuristics. In Annual Computer Se- 
curity Applications Conference, pages 287-296, 
2010. 


P. V. Prahbu, Y. Song, and S. J. Stolfo. Smash- 
ing the Stack with Hydra: The Many Heads of Ad- 
vanced Polymorphic Shellcode, 2009. Presented at 
Defcon 17, Las Vegas. 


M. Probst. Fast machine-adaptable dynamic binary 


translation. In Proceedings of the Workshop on Bi- 
nary Translation, 2001. 


20th USENIX Security Symposium 


[31] N. Provos, D. McNamee, P. Mavrommatis, 
K. Wang, and N. Modadugu. The ghost in the 
browser: Analysis of web-based malware. In 
Usenix Workshop on Hot Topics in Botnets, 2007. 


[32] N. Provos, P. Mavrommatis, M. A. Rajab, and 
F. Monrose. All Your iFRAMEs Point to Us. In 
USENIX Security Symposium, pages 1-15, 2008. 


T. Raffetseder, C. Kruegel, and E. Kirda. Detecting 
System Emulators. Information Security, 4779:1- 
18, 2007. 


[34] M. A. Rahman. Getting Owned by malicious PDF - 
analysis. SANS Institute, InfoSec Reading Room, 
2010. 


[35] P. Ratanaworabhan, B. Livshits, and B. Zorn. NOZ- 
ZLE: A Defense Against Heap-spraying Code In- 
jection Attacks. In USENIX Security Symposium, 
pages 169-186, 2009. 


[33 


“4 


[36] A. Sotirov and M. Dowd. Bypassing Browser 
Memory Protections. In Black Hat USA, 2008. 


[37] D. Stevens. Malicious PDF documents. Informa- 
tion Systems Security Association (ISSA) Journal, 
July 2010. 


T. Toth and C. Kruegel. Accurate Buffer Overflow 
Detection via Abstract Payload Execution. In Jnter- 
national Symposium on Recent Advances in Intru- 
sion Detection, pages 274-291, 2002. 


[39] Z. Tzermias, G. Sykiotakis, M. Polychronakis, and 
E. P. Markatos. Combining static and dynamic 
analysis for the detection of malicious documents. 
In Proceedings of the Fourth European Workshop 
on System Security, pages 4:1-4:6, New York, NY, 
USA, 2011. 


[40] A. Vasudevan and R. Yerraballi. Stealth break- 
points. In 2/st Annual Computer Security Appli- 
cations Conference, pages 381-392, 2005. 


[41] X. Wang, Y.-C. Jhi, S. Zhu, and P. Liu. STILL: 
Exploit Code Detection via Static Taint and Initial- 
ization Analyses. Annual Computer Security Appli- 
cations Conference, pages 289-298, Dec 2008. 


[38 


“4 


[42] Y. Younan, P. Philippaerts, F. Piessens, W. Joosen, 
S. Lachmund, and T. Walter. Filter-resistant code 
injection on ARM. In ACM Conference on Com- 
puter and Communications Security, pages 11-20, 
2009. 


Q. Zhang, D. S. Reeves, P. Ning, and S. P. Iyer. An- 
alyzing Network Traffic to Detect Self-Decrypting 
Exploit Code. In ACM Symposium on Information, 
Computer and Communications Security, 2007. 


[43 


“4 


USENIX Association 


MACE: Model-inference-Assisted Concolic Exploration 
for Protocol and Vulnerability Discovery 


Chia Yuan Cho** Domagoj Babi¢' Pongsin Poosankam‘$ 
Kevin Zhijie Chen‘ Edward XueJun Wu' Dawn Song* 
' University of California, Berkeley 8 Carnegie Mellon University *DSO National Labs 


Abstract 


Program state-space exploration is central to software se- 
curity, testing, and verification. In this paper, we propose 
a novel technique for state-space exploration of software 
that maintains an ongoing interaction with its environ- 
ment. Our technique uses a combination of symbolic and 
concrete execution to build an abstract model of the ana- 
lyzed application, in the form of a finite-state automaton, 
and uses the model to guide further state-space explo- 
ration. Through exploration, MACE further refines the 
abstract model. Using the abstract model as a scaffold, 
our technique wields more control over the search pro- 
cess. In particular: (1) shifting search to different parts of 
the search-space becomes easier, resulting in higher code 
coverage, and (2) the search is less likely to get stuck in 
small local state-subspaces (e.g., loops) irrelevant to the 
application’s interaction with the environment. Prelim- 
inary experimental results show significant increases in 
the code coverage and exploration depth. Further, our 
approach found a number of new deep vulnerabilities. 


1 Introduction 


Designing secure systems is an exceptionally hard prob- 
lem. Even a single bug in an inopportune place can create 
catastrophic security gaps. Considering the size of mod- 
ern software systems, often reaching tens of millions of 
lines of code, exterminating all the bugs is a daunting 
task. Thus, innovation and development of new tools 
and techniques that help closing security gaps is of crit- 
ical importance. In this paper, we propose a new tech- 
nique for exploring the program’s state-space. The tech- 
nique explores the program execution space automati- 


$This work was done while Pongsin Poosankam was a visiting stu- 
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cally by combining exploration with learning of an ab- 
stract model of program’s state space. More precisely, 
it alternates (1) a combination of concrete and symbolic 
execution [22] to explore the program’s state-space, and 
(2) the L* [1] online learning algorithm to construct high- 
level models of the state-space. Such abstract models, in 
turn, guide further search. In contrast, the prior state- 
space exploration techniques treat the program as a flat 
search-space, without distinguishing states that corre- 
spond to important input processing events. 


A combination of concrete execution and symbolic 
reasoning, known as DART, concolic (concrete and 
symbolic) execution, and dynamic symbolic execution 
[17, 25, 8, 7], exploits the strengths of both. The con- 
crete execution creates a path, followed by symbolic ex- 
ecution, which computes a symbolic logical formula rep- 
resenting the branch conditions along the path. Manipu- 
lation of the formula, e.g., negation of a particular branch 
predicate, produces a new symbolic formula, which is 
then solved with a decision procedure. If a solution ex- 
ists, the solution represents an input to the concrete exe- 
cution, which takes the search along a different path. The 
process is repeated iteratively until the user reaches the 
desired goal (e.g., number of bugs found, code coverage, 
etc.). 


We identified two ways to improve this iterative pro- 
cess. First, dynamic symbolic execution has no high- 
level information about the structure of the overall pro- 
gram state-space. Thus, it has no way of knowing how 
close (or how far) it is from reaching important states 
in the program and is likely to get stuck in local state- 
subspaces, such as loops. Second, unlike decision proce- 
dures that learn search-space pruning lemmas from each 
iteration (e.g., [30]), dynamic symbolic execution only 
tracks the most promising path prefix for the next iter- 
ation [17], but does not learn in the sense that informa- 
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tion gathered in one iteration is used either to prune the 
search-space or to get to interesting states faster in later 
iterations. 

These two insights led us to develop an approach 
— Model-inference-Assisted Concolic (concrete and 
symbolic) Exploration (MACE) — that learns from each 
iteration and constructs a finite-state model of the search- 
space. We primarily target applications that maintain 
an ongoing interaction with its environment, like servers 
and web services, for which a finite-state model is fre- 
quently a suitable abstraction of the communication pro- 
tocol, as implemented by the application. At the same 
time, we both learn the protocol model and exploit the 
model to guide the search. 

MACE relies upon dynamic symbolic execution to 
discover the protocol messages, uses a special filtering 
component to select messages over which the model 
is learned, and guides further search with the learned 
model, refining it as it discovers new messages. Those 
three components alternate until the process converges, 
automatically inferring the protocol state machine and 
exploring the program’s state-space. 

We have implemented our approach and applied it to 
four server applications (two SMB and two RFB im- 
plementations). MACE significantly improved the line 
coverage of the analyzed applications, and more im- 
portantly, discovered four new vulnerabilities and three 
known ones. One of the discovered vulnerabilities re- 
ceived Gnome’s “Blocker” severity, the highest severity 
in their ranking system meaning that the next release can- 
not be shipped without a fix. Our work makes the follow- 
ing contributions: 


e Although dynamic symbolic execution and decision 
procedures perform very similar tasks, the state- 
of-the-art decision procedures feature many tech- 
niques, like learning, that yet have to find their way 
into dynamic symbolic execution. While in deci- 
sion procedures, learned information can be conve- 
niently represented in the same format as the solved 
formula, e.g., in the form of CNF clauses in SAT 
solvers, it is less clear how would one learn or rep- 
resent the knowledge accumulated during the dy- 
namic symbolic execution search process. We pro- 
pose that for applications that interact with their en- 
vironment through a protocol, one could use finite- 
state machines to represent learned information and 
use them to guide the search. 

e As the search progresses, it discovers new infor- 
mation that can be used to refine the model. We 
show one possible way to keep refining the model 
by closing the loop — search incrementally refines 
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the model, while the model guides further search. 

e At the same time, MACE both infers a model of 
the protocol, as implemented by a program, and 
explores the program’s search space, automatically 
generating tests. Thus, our work contributes both to 
the area of automated reverse-engineering of proto- 
cols and automated program testing. 

e MACE discovered seven vulnerabilities (four of 
which are new) in four applications that we ana- 
lyzed. Furthermore, we show that MACE performs 
deeper state-space exploration than the baseline dy- 
namic symbolic execution approach. 


2 Related Work 


Model-guided testing has a long history. The hard- 
ware testing community has developed modeling lan- 
guages, like System Verilog, that allow verification teams 
to specify input constraints that are solved with a deci- 
sion procedure to generate random inputs. Such inputs 
are randomized, but adhere to the specified constraints 
and therefore tend to reach much deeper into the tested 
system than purely random tests. Constraint-guided ran- 
dom test generation is nowadays the staple of hardware 
testing. The software community developed its own lan- 
guages, like Spec# [3], for describing abstract software 
models. Such models can be used effectively as con- 
straints for generating tests [27], but have to be written 
manually, which is both time consuming and requires a 
high level of expertise. 

Grammar inference (e.g., [16]) promises automatic in- 
ference of models, and has been an active area of re- 
search in security, especially applied to protocol infer- 
ence. Comparetti et al. [12] infer incomplete (possibly 
missing transitions) protocol state machines from mes- 
sages collected by observing network traffic. To reduce 
the number of messages, they cluster messages according 
to how similar the messages are and how similar their ef- 
fects are on the execution. Comparetti et al. show how 
the inferred protocol models can be used for fuzzing. 
Our work shares similar goals, but features a few im- 
portant differences. First, MACE iteratively refines the 
model using dynamic symbolic execution [18, 25, 9, 7] 
for the state-space exploration. Second, rather than fil- 
tering out individual messages through clustering of in- 
dividual messages, we look at the entire sequences. If 
there is a path in the current state machine that produces 
the same output sequence, we discard the corresponding 
input sequence. Otherwise, we add all the input mes- 
sages to the set used for inferring the state machine in 
the next iteration. Third, rather than using the inferred 
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model for fuzzing, we use the inferred model to initialize 
state-space exploration to a desired state, and then run 
dynamic symbolic execution from the initialized state. 


In our prior work [10], we proposed an alternative pro- 
tocol state machine inference approach. There we as- 
sume the end users would provide abstraction functions 
that abstract concrete input and output messages into 
an abstract alphabet, over which we infer the protocol. 
Designing such abstraction functions is sometimes non- 
trivial and requires multiple iterations, especially for pro- 
prietary protocols, for which specifications are not avail- 
able. In this paper, we drop the requirement for user- 
provided input message abstraction, but we do require a 
user-provided output message abstraction function. The 
output abstraction function determines the granularity of 
the inferred abstraction. The right granularity of abstrac- 
tion is important for guiding state-space exploration, be- 
cause too fine-grained abstractions tend to be too expen- 
sive to infer automatically, and too abstract ones fail to 
differentiate interesting protocol states. Furthermore, our 
prior work is a purely black-box approach, while in this 
paper we do code analysis at the binary level in combi- 
nation with grammatical inference. 


In this paper, we analyze implementations of protocols 
for which the source code or specifications are available. 
However, MACE could also be used for inference of 
proprietary protocols and for state-exploration of closed- 
source third-party binaries. In that case, the users would 
need to rely upon the prior research to construct a suit- 
able output abstraction function. The first step in con- 
structing a suitable output abstraction function is under- 
standing the message format. Cui et al. [14, 15] and Ca- 
ballero et al. [6] proposed approaches that could be used 
for that purpose. Further, any automatic protocol infer- 
ence technique has to deal with encryption. In this paper, 
we simply configure the analyzed server applications so 
as to disable encryption, but that might not be an option 
when inferring a proprietary protocol. The work of Ca- 
ballero et al. [5] and Wang et al. [29] addresses automatic 
reverse-engineering of encrypted messages. 


Software model checking tools, like SLAM [2] and 
Blast [20], incrementally build predicate abstractions of 
the analyzed software, but such abstractions are very dif- 
ferent from the models inferred by the protocol inference 
techniques [12, 11]. Such abstractions closely reflect the 
control-flow structure of the software from which they 
were inferred, while our inferred models are more ab- 
stract and tend to have little correlation with the low-level 
program structure. Further, depending on the inference 
approach used, the inferred models can be minimal (like 
in our work), which makes guidance of state-space ex- 
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ploration techniques more effective. 

The Synergy algorithm [19] combines model- 
checking and dynamic symbolic execution to try to cover 
all abstract states of a program. Our work has no ambi- 
tion to produce proofs, and we expect that our approach 
could be used to improve the dynamic symbolic execu- 
tion part of Synergy and other algorithms that use dy- 
namic symbolic execution as a component. 

The Ketchum approach [21] combines random sim- 
ulation to drive a hardware circuit into an interesting 
state (according to some heuristic), and performs local 
bounded model checking around that state. After reach- 
ing a predefined bound, Ketchum continues random sim- 
ulation until it stumbles upon another interesting state, 
where it repeats bounded model checking. Ketchum be- 
came the key technology behind Magellan™, one of 
the most successful semi-formal hardware test genera- 
tion tools. MACE has similar dynamics, but the com- 
ponents are very different. We use the L* [1] finite-state 
machine inference algorithm to infer a high-level abstract 
model and declare all the states in the model as interest- 
ing, while Ketchum picks interesting states heuristically. 
While Ketchum uses random simulation, we drive the 
analyzed software to the interesting state by finding the 
shortest path in the abstract model. Ketchum explores the 
vicinity of interesting states via bounded model check- 
ing, while we start dynamic symbolic execution from the 
interesting state. 


3 Problem Definition and Overview 


We begin this section with the problem statement and a 
list of assumptions that we make in this paper. Next, we 
discuss possible applications of MACE. At the end of 
this section, we introduce the concepts and notation that 
will be used throughout the paper. 


3.1 Problem Statement 


We have three, mutually supporting, goals. First, we 
wish to automatically infer an abstract finite-state model 
of a program’s interaction with its environment, i.e., a 
protocol as implemented by the program. Second, once 
we infer the model, we wish to use it to guide a com- 
bination of concrete and symbolic execution in order to 
improve the state-space exploration. Third, if the explo- 
ration phase discovers new types of messages, we wish 
to refine the abstract model, and repeat the process. 
There are two ways to refine the abstract finite-state 
model; by adding more states, and by adding more mes- 
sages to the state machine’s input (or output) alphabet, 
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Figure 1: An Abstract Rendition of the MACE State- 
Space Exploration. The figure on the left shows an 
abstract model, i.e., a finite-state machine, inferred by 
MACE. The figure on the right depicts clusters of con- 
crete states of the analyzed application, such that clus- 
ters are abstracted with a single abstract state. We infer 
the abstract model with L*, initialize the analyzed appli- 
cation to the desired state, and then use the state-space 
exploration component of MACE to explore the concrete 
clusters of states. 


which can result in inference of new transitions and 
states. Black box inference algorithms, like L* [1], in- 
fer a state machine over a fixed-size alphabet by itera- 
tively discovering new states. Such algorithms can be 
used for the first type of refinement. Any traditional pro- 
gram state-space exploration technique could be used to 
discover new input (or output) messages, but adding all 
the messages to the state machine’s alphabets would ren- 
der the inference computationally infeasible. Thus, we 
also wish to find an effective way to reduce the size of 
the alphabet, without missing states during the inference. 


The constructed abstract model can guide the search in 
many ways. The approach we take in this paper is to use 
the abstract model to generate a sequence of inputs that 
will drive the abstract model and the program to the de- 
sired state. After the program reaches the desired state, 
we explore the surrounding state-space using a combina- 
tion of symbolic and concrete execution. Through such 
exploration, we might visit numerous states that are all 
abstracted with a single state in the abstract model and 
discover new inputs that can refine the abstract model. 
Figure | illustrates the concept. 
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In our work, we make a few assumptions: 


Determinism We assume the analyzed program’s com- 
munication with its environment is deterministic, 
i.e., the same sequence of inputs always leads 
to the same sequence of outputs and the same 
state. In practice, programs can exhibit some non- 
determinism, which we are abstracting away. For 
example, the same input message could produce 
two different outputs from the same state. In such 
a case, we put both output messages in the same 
equivalence class by adjusting our output abstrac- 
tion (see below). 


Resettability We assume the analyzed program can be 
easily reset to its initial state. The reset may be 
achieved by restarting the program, re-initializing 
its environment or variables, or simply initiating a 
new client connection. In practice, resetting a pro- 
gram is usually straightforward, since we have a 
complete control of the program. 


Output Abstraction Function We assume the exis- 
tence of an output abstraction function that ab- 
stracts concrete response (output) messages from 
the server into an abstract set of messages (alpha- 
bet) used for state machine inference. In practice, 
this assumption often reduces to manually identi- 
fying which sub-fields of output messages will be 
used to distinguish output message types. The out- 
put alphabet, in MACE, determines the granularity 
of abstraction. 


3.2 Applications 


The primary intended application of MACE is state- 
space exploration of programs communicating with their 
environment through a protocol, e.g., networked appli- 
cations. We use the inferred protocol state machine as a 
map that tells us how to quickly get to a particular part 
of the search-space. In comparison, model checking and 
dynamic symbolic execution approaches consider the ap- 
plication’s state-space flat, and do not attempt to exploit 
the structure in the state machine of the communication 
protocol through which the application communicates 
with the world. Other applications of MACE include 
proprietary protocol inference, extension of the existing 
protocol test suites, conformance checking of different 
protocol implementations, and fingerprinting of imple- 
mentation differences. 
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3.3. Preliminaries 


Following our prior work [10], we use Mealy machines 
[23] as abstract protocol models. Mealy machines are 
natural models of protocols because they specify transi- 
tion and output functions in terms of inputs. Mealy ma- 
chines are defined as follows: 


Definition 1 (Mealy Machine). A Mealy machine, M, is 
a six-tuple (Q,2;,20,6,4,q0), where Q is a finite non- 
empty set of states, qo € Q is the initial state, Xz is a 
finite set of input symbols (i.e., the input alphabet), Xo is 
a finite set of output symbols (i.e., the output alphabet), 
6 :QxX,; — Q is the transition relation, and A: Q x 
x7 — Xo is the output relation. 


We extend the 6 and A relations to sequences of 
messages mj; € L, as usual, e.g., 5(g,mo-m,-m2) = 
5(6(6(q,mo),m),m2) and A(q,mo-mj-m2) = 
A (qg,mo) - A (6 (qg,mo),m1)-A (6 (g,mo-m1),m2). To 
denote sequences of input (resp. output) messages 
we will use lower-case letters s,t (resp. 0). For 
s € Ly,m € Ly, the length |s| is defined inductively: 
|e] = 0,|s-m| = |s|+ 1, where € is the empty sequence. 
The j-th message m; in the sequence s = mg - my ---™yj—1 
will be referred to as s;. We define the support function 
sup as sup(s) = {s;|O0<j<|s|}. If for some state 
machine M = (Q,2/,Xo0,6,A,qo) and some state gq € Q 
there is s € LF such that 6(qo,5) = q, we say there is a 
path from qo to q, i.e., that g is reachable from the initial 
state, denoted qy —> q. Since L* infers minimal state 
machines, all states in the abstract model are reachable. 
In general, each state could be reachable by multiple 
paths. For each state g, we (arbitrary) pick one of the 
shortest paths formed by a sequence of input messages 
s, such that go > q, and call it a shortest transfer 
sequence. 

Our search process discovers numerous input and out- 
put messages, and using all of them for the model in- 
ference would not scale. Thus, we heuristically discard 
redundant input messages, defined as follows: 


Definition 2 (Redundant Input Symbols). Let M = 
(Q,27,20,5,4,q0) be a Mealy machine. A symbol m € 
x7 is said to be redundant if there exists another sym- 
bol, m! € X;, such that m #4 m' andVq € Q . A(q,m) = 
2(g.m!) \8(q,m) = 8(q,m), 


We say that a Mealy machine M = (Q,2/,20,6,A,q0) 
is complete iff 6(¢,i) and A(q,i) are defined for every 
q € Qandi € X/. In this paper, we infer complete Mealy 
machines. There is also another type of completeness 
— the completeness of the input and output alphabet. 
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MACE cannot guarantee that the input alphabet is com- 
plete, meaning that it might not discover some types of 
messages required to infer the full state machine of the 
protocol. 


To infer Mealy machines, we use Shahbaz and Groz’s 
[26] variant of the classical L* [1] inference algorithm. 
We describe only the intuition behind L*, as the algo- 
rithm is well-described in the literature. 


L* is an online learning algorithm that proactively 
probes a black box with sequences of messages, listens to 
responses, and builds a finite state machine from the re- 
sponses. The black box is expected to answer the queries 
in a faithful (i.e., it is not supposed to cheat) and deter- 
ministic way. Each generated sequence starts from the 
initial state, meaning that L* has to reset the black box 
before sending each sequence. Once it converges, L* 
conjectures a state machine, but it has no way to ver- 
ify that it is equivalent to what the black box imple- 
ments. Three approaches to solving this problem have 
been described in the literature. The first approach is to 
assume an existence of an oracle capable of answering 
the equivalence queries. L* asks the oracle whether the 
conjectured state machine is equivalent to the one im- 
plemented by the black box, and the oracle responds ei- 
ther with ‘yes’ if the conjecture is equivalent, or with 
a counterexample, which L* uses to refine the learned 
state machine and make another conjecture. The pro- 
cess is guaranteed to terminate in time polynomial in 
the number of states and the size of the input alphabet. 
However, in practice, such an oracle is unavailable. The 
second approach is to generate random sampling queries 
and use those to test the equivalence between the con- 
jecture and the black box. If a sampling query discovers 
a mismatch between a conjecture and the black box, re- 
finement is done the same way as with the counterexam- 
ples that would be generated by equivalence queries. The 
sampling approach provides a probabilistic guarantee [1] 
on the accuracy of the inferred state machine. The third 
approach, called black box model checking [24], uses 
bounded model checking to compare the conjecture with 
the black box. 


As discussed in Section 3.1, MACE requires an out- 
put message abstraction function  : Zo — Lo, where 
Mo is the set of all concrete output messages, that ab- 
stracts concrete output messages into the abstract output 
alphabet. However, unlike the prior work [10], MACE 
requires no input abstraction function. We will extend 
the output abstraction function to sequences as follows. 
Let o € .@ be a sequence of concrete output messages 
such that |o| =n. The abstraction of a sequence is de- 
fined as Ao(0) = Ao(00) +: Ao(On-1). 
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Figure 2: The MACE Approach Diagram. The L* algorithm takes in the input and output alphabets, over which it 
infers a state-machine. L* sends queries and receives responses from the analyzed application, which is not shown in 
the figure. The result of inference is a finite-state machine (FSM). For every state in the inferred state machine, We 
generate a shortest transfer sequence (Section 3.3) that reaches the desired state, starting from the initial state. Such 
sequences are used to initialize the state-space explorer, which runs dynamic symbolic execution after the initialization. 
The state-space explorers run the analyzed application (not shown) in parallel. 


4 Model-inference-Assisted Concolic 
Exploration 


We begin this section by a high-level description of 
MACE, illustrated in Figure 2. After the high-level de- 
scription, each section describes a major component of 
MACE: abstract model inference, concrete state-space 
exploration, and filtering of redundant concrete input 
messages together with the abstract model refinement. 


4.1 A High-Level Description 


Suppose we want to infer a complete Mealy machine 
M = (Q,27,X0,5,4,q0) representing some protocol, as 
implemented by the given program. We assume to know 
the output abstraction function Qo that abstracts con- 
crete output messages into Lg. To bootstrap MACE, we 
also assume to have an initial set Xj C X, of input mes- 
sages, which can be extracted from either a regression 
test suite, collected by observing the communication of 
the analyzed program with the environment, or obtained 
from DART and similar approaches [17, 25, 8, 7]. The 
initial Ly alphabet could be empty, but MACE would 
take longer to converge. In our work, we used regression 
test suites provided with the analyzed applications, or ex- 
tracted messages from a single observed communication 
session if the test suite was not available. 

Next, L* infers the first state machine Mo = 
(Qo, 210, Zo, 50,40, 99) using Lyo and Xo as the abstract 
alphabets. In Mo, we find a shortest transfer sequence 
from q to every state g € Qo. We use such sequences 
to drive the program to one of the concrete states repre- 
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sented by the abstract state g. Since each abstract state 
could correspond to a large cluster of concrete states 
(Fig. 1), we use dynamic symbolic execution to explore 
the clusters of concrete states around abstract states. 


The state-space exploration generates sequences of 
concrete input and the corresponding output messages. 
Using the output abstraction function Q, we can abstract 
the concrete output message sequences into sequences 
over £4. However, we cannot abstract the concrete in- 
put messages into a subset of £7, as we do not have the 
concrete input message abstraction function. Using all 
the concrete input messages for the L*-based inference 
would be computationally infeasible. The state-space 
exploration discovers hundreds of thousands of concrete 
messages, because we run the exploration phase for hun- 
dreds of hours, and on average, it discovers several thou- 
sand new concrete messages per hour. 

Thus, we need a way to filter out redundant messages 
and keep the ones that will allow L* to discover new 
states. The filtering is done as follows. Suppose that s 
is a sequence of concrete input messages generated from 
the exploration phase and o € X7 a sequence of the corre- 
sponding abstract output messages. If there exists t € Lj, 
such that Mp accepts t generating o, we discard s. Oth- 
erwise, at least one concrete message in the s sequence 
generates either a new state or a new transition, so we re- 
fine the input alphabet and compute Ly; = Xyo U sup (s). 

With the new abstract input alphabet 27;, we infer a 
new, more refined, abstract model M, and repeat the pro- 
cess. If the number of messages is finite and either the 
exploration phase terminates or runs for a predetermined 
bounded amount of time, MACE terminates as well. 
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4.2 Model Inference with L* 


MACE learns the abstract model of the analyzed pro- 
gram by constructing sequences of input messages, send- 
ing them to the program, and reasoning about the re- 
sponses. For the inference, we use Shahbaz and Groz’s 
[26] variant of L* for learning Mealy machines. The in- 
ference process is similar as in our prior work [10]. 

In every iteration of MACE, L* infers a new state ma- 
chine over Xj; and the new messages discovered by the 
state-space exploration guided by Mj, and conjectures 
Mj, a refinement of M;. Out of the three options for 
checking conjectures discussed in Section 3.3, we chose 
to check conjectures using the sampling approach. We 
could use sampling after each iteration, but we rather 
defer it until the whole process terminates. In other 
words, rather than doing sampling after each iteration, 
we use the subsequent MACE iterations instead of the 
traditional sampling. Once the process terminates, we 
generate sampling queries, but in no experiment we per- 
formed did sampling discover any new states. 


4.3 The State-Space Exploration Phase 


We use the model inferred in Section 4.2 to guide the 
state-space exploration. For every state g' € Q; of the 
just inferred abstract model M;, we compute a shortest 
transfer sequence of input messages from the initial state 
q). Suppose the computed sequence is s € 27. With 
Ss, we drive the analyzed application to a concrete state 
abstracted by the gq! state in the abstract model. All mes- 
sages sup(s) are concrete messages either from the set 
of seed messages, or generated by previous state-space 
exploration iterations. Thus, the process of driving the 
analyzed application to the desired state consists of only 
computing a shortest path in M; to the state, collecting 


the input messages along the path qh a q‘, and feeding 
that sequence of concrete messages into the application. 

Once the application is in the desired state g', we 
run dynamic symbolic execution from that state to ex- 
plore the surrounding concrete states (Figure 1). In other 
words, the transfer sequence of input messages produces 
a concrete run, which is then followed by symbolic ex- 
ecution that computes the corresponding path-condition. 
Once the path-condition is computed, dynamic symbolic 
execution resumes its normal exploration. We bound 
the time allotted to exploring the vicinity of every ab- 
stract state. In every iteration, we explore only the newly 
discovered states, i.e., Q;\Q;-1. Re-exploring the same 
states over and over would be unproductive. 

Thanks to the abstract model, MACE can easily com- 
pute the necessary input message permutations required 
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to reach any abstract model state, just by computing a 
shortest path. On the other hand, approaches that com- 
bine concrete and symbolic execution have to negate 
multiple predicates and get the decision procedure to 
generate the required sequence of concrete input mes- 
sages to get to a particular state. MACE has more con- 
trol over this process, and our experimental results show 
that the increased control results in higher line coverage, 
deeper analysis, and more vulnerabilities found. 


4.4 Model Refinement 


The exploration phase described in Section 4.3 generates 
a large number (hundreds of thousands in our setting) of 
new concrete messages. Using all of them to refine the 
abstract model is both unrealistic, as inference is polyno- 
mial in the size of the alphabet, and redundant, as many 
messages are duplicates and belong to the same equiv- 
alence class. To reduce the number of input messages 
used for inference, Comparetti et al. [12] propose a mes- 
sage clustering technique, while we used a handcrafted 
an abstraction function in our prior work. In this paper, 
we take a different approach. 

In the spirit of dynamic symbolic execution, the explo- 
ration phase solves the path-condition (using a decision 
procedure) to generate new concrete inputs, more pre- 
cisely, sequences of concrete input messages. During the 
concrete part of the exploration phase, such sequences 
of input messages are executed concretely, which gen- 
erates the corresponding sequence of output messages. 
We abstract the generated sequence of output messages 
using Qo. If the abstracted sequence can be generated 
by the current abstract model, we discard the sequence, 
otherwise we add all the corresponding concrete input 
messages to X7;. We define this process more formally: 


Definition 3 (Filter Function). Let 4 (resp. Mo) be 
a (possibly infinite) set of all possible concrete input 
(resp. output) messages. Let s € Mf (resp. 0 © 5) 
be a sequence of concrete input (resp. output) messages 
such that |s| = |o|. We assume that each input message 
Sj; produces 0; as a response. Let M; € & be the ab- 
stract model inferred in the last iteration and &f the uni- 
verse of all possible Mealy machines. The filter function 
f: 2x Mf x Mp 2 is defined as follows: 





t € LF, . Ai(t) = Ao(o) 
otherwise 


0 if A 

Mi,s,o) = 
f( Up ad! ) { sup (s) 

In practice, a single input message could produce ei- 
ther no response or multiple output messages. In the 
first case, our implementation generates an artificial no- 
response message, and in the second case, it picks the 
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first produced output message. A more advanced im- 
plementation could infer a subsequential transducer [28], 
instead of a finite-state machine. A subsequential trans- 
ducer can transduce a single input into multiple output 
messages. 

Once the exploration phase is done, we apply the filter 
function to all newly found input and output sequences 
s; and o;, and refine the alphabet &); by adding the mes- 
sages returned by the filter function. More precisely: 


X1(i+1) a Li U LJ f(Mi.sj,0;) 
J 


In the next iteration, L* learns a new model Mj+1, a re- 
finement of M;, over the refined alphabet 27; 1). 


5 Implementation 


In this section, we describe our implementation of 
MACE. The L* component sends queries to and collects 
responses from the analyzed server, and thus can be seen 
as a client sending queries to the server and listening to 
the corresponding responses. Section 5.1 explains this 
interaction in more detail. Section 5.2 surveys the main 
model inference optimizations, including parallelization, 
caching, and filtering. Finally, Section 5.3 introduces our 
state-space exploration component, which is used as a 
baseline for the later provided experimental results. 


5.1 L* asa Client 


Our implementation of L* infers the protocol state ma- 
chine over the concrete input and abstract output mes- 
sages. As aclient, L* first resets the server, by clearing its 
environment variables and resetting it to the initial state, 
and then sends the concrete input message sequences di- 
rectly to the server. 

Servers have a large degree of freedom in how quickly 
they want to reply to the queries, which introduces non- 
deterministic latency that we want to avoid. For one 
server application we analyzed (Vino), we had to slightly 
modify the server code to assure synchronous response. 
We wrote wrappers around the poll and read system 
calls that immediately respond to the L*’s queries, mod- 
ifying eight lines of code in Vino. 


5.2 Model Inference Optimizations 


We have implemented the L* algorithm with distributed 
master-worker parallelization of queries. L* runs in the 
master node, and distributes its queries among the worker 
nodes. The worker nodes compute the query responses, 
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by sending the input sequences to the server, collecting 
and abstracting responses, and sending them back to L*. 

Since model refinement requires L* to make repeated 
queries across iterations, we maintain a cache to avoid 
re-computing responses to the previously seen queries. 
L* looks up the input in the cache before sending queries 
to worker nodes. 

As L*’s queries could trigger bugs in the server appli- 
cation, responses could be inconsistent. For example, if 
L* emits two sequences of input messages, s and t, such 
that s is a prefix of t, then the response to s should be a 
prefix of the response to t. Before adding an input-output 
sequence pair to the cache, we check that all the prefixes 
are consistent with the newly added pair, and report a 
warning if they are inconsistent. 

After each inference iteration, we analyze the state 
machine to find redundant messages (Definition 2) and 
discard them. This is a simple, but effective, optimiza- 
tion that reduces the load on the subsequent MACE it- 
erations. This optimization is especially important for 
inferring the initial state machine from the seed inputs. 


5.3 State-Space Exploration 


Our implementation of the state-space exploration con- 
sists of two components: a shortest transfer sequence 
generator and the state-space explorer. A shortest trans- 
fer sequence generator is implemented through a simple 
modification of the L* algorithm. The algorithm main- 
tains a data structure (called observation table [1]) that 
contains a set of shortest transfer sequences, one for each 
inferred state. We modify the algorithm to output this 
set together with the final model. MACE uses sequences 
from the set to launch and initialize state-space explorers. 

Our state-space explorer uses a combination of dy- 
namic and symbolic execution [17, 25, 8, 7]. The imple- 
mentation consists of a system emulator, an input gener- 
ator, and a priority queue. The system emulator collects 
execution traces of the analyzed program with respect 
to given concrete inputs. Given a collected trace, the 
input generator performs symbolic execution along the 
traced path, computes the path-condition, modifies the 
path condition by negating predicates, and uses a deci- 
sion procedure to solve the modified path condition and 
to generate new inputs that explore different execution 
paths. The generated inputs are then provided back to the 
system emulator and the exploration continues. We use 
the priority queue, like [18], to prioritize concrete traces 
that are used for symbolic execution. The traces that visit 
a larger number of new basic blocks, unexplored by the 
prior traces, have higher priority. 
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The system emulator provides the capability to save 
and restore program snapshots. ‘To perform model- 
assisted exploration from a desired state in the model, 
we first set the program state to the snapshot of the ini- 
tial state. Then, we drive the program to the desired state 
using the corresponding shortest transfer sequence, and 
start dynamic symbolic execution from that state. 

In all our experiments, we used the snapshot capability 
to skip the server boot process. More precisely, we boot 
the server, make a snapshot, and run all the experiments 
on the snapshot. We do not report the code executed dur- 
ing the boot in the line coverage results. 


6 Evaluation 


To evaluate MACE, we infer server-side models of two 
widely-deployed network protocols: Remote Frame- 
buffer (RFB) and Server Message Block (SMB). The 
RFB protocol is widely used in remote desktop appli- 
cations, including GNOME Vino and RealVNC. Mi- 
crosoft’s SMB protocol provides file and printer shar- 
ing between Windows clients and servers. Although the 
SMB protocol is proprietary, it was reverse-engineered 
and re-implemented as an open-source system, called 
Samba. Samba allows interoperability between Win- 
dows and Unix/Linux-based systems. In our experi- 
ments, we use Vino 2.26.1 and Samba 3.3.4 as reference 
implementations to infer the protocol models of RFB and 
SMB respectively. We discuss the result of our model in- 
ference in Section 6.2. 

Once we infer the protocol model from one reference 
implementation, we can use it to guide state-space ex- 
ploration of other implementations of the same proto- 
col. Using this approach, we analyze RealVNC 4.1.2 
and Windows XP SMB, without re-inferring the proto- 
col state machine. 

MACE found a number of critical vulnerabilities, 
which we discuss in Section 6.3. In Section 6.4, we eval- 
uate the effectiveness of MACE, by comparing it to the 
baseline state-space exploration component of MACE 
without guidance. 


6.1 Experimental Setup 


For our state-space exploration experiments, we used the 
DETER Security testbed [4] comprised of 3GHz Intel 
Xeon processors. For running L* and the message fil- 
tering, we used a few slower 2.27GHz Intel Xeon ma- 


Vino is the default remote desktop application in GNOME 
distributions; RealVNC reports over 100 million downloads 
(http://www.realvnc.com). 
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Program | Iter. | |Q| | |Zy| | |Zo| | Tot. Learning 


(Protocol) Time (min) 
Vino 142 
(RFB) 8 
Samba 2028 
(SMB) 1840 

307 





Table 1: Model Inference Result at the End of Each Iter- 
ation. The second column identifies the inference itera- 
tion. The Q column denotes the number of states in the 
inferred model. The X; (resp. Xg) column denotes the 
size of the input (resp. output) alphabet. The last column 
gives the total time (sum of all parallel jobs together) re- 
quired for learning the model in each iteration, including 
the message filtering time. The learning process is incre- 
mental, so later iterations can take less time, as the older 
conjecture might need a small amount of refinement. 


chines. When comparing MACE against the baseline ap- 
proach, we sum the inference and the state-space explo- 
ration time taken by MACE, and compare it to running 
the baseline approach for the same amount of time. This 
setup gives a slight advantage to the baseline approach 
because inference was done on slower machines, but our 
experiments still show MACE is significantly superior, 
in terms of achieved coverage, found vulnerabilities and 
exploration depth. 


6.2 Model Inference and Refinement 


We used MACE to iteratively infer and refine the pro- 
tocol models of RFB and SMB, using Vino 2.26.1 and 
Samba 3.3.4 as reference implementations respectively. 
Table 1 shows the results of iterative model inference and 
refinement on Vino and Samba. 

As discussed in Section 4.2, once MACE terminates, 
we check the final inferred model with sampling queries. 
We used 1000 random sampling queries composed of 40 
input messages each, and tried to refine the state machine 
beyond what MACE inferred. The sampling did not dis- 
cover any new state in any experiment we performed. 

Vino. For Vino, we collected a 45-second network 
trace of a remote desktop session, using krdc (KDE Re- 
mote Desktop Connection) as the client. During this ses- 
sion, the Vino server received a total of 659 incoming 
packets, which were considered as seed messages. For 
abstracting the output messages, we used the message 
type and the encoding type of the outbound packets from 
the server. MACE inferred the initial model consisting of 
seven states, and filtered out all but 8 input and 7 output 
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(a) Original Vino’s RFB Model Based on Observed Live Traffic. 
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(b) Final Vino’s RFB Model Inferred by MACE. 


Figure 3: Model Inference of Vino’s RFB protocol. States in which MACE discovers vulnerabilities are shown in grey. 
The edge labels show the list of input messages and the corresponding output message separated by the ‘/’ symbol. 
The explanations of the state and input and output message encodings are in Figure 4. 


messages, as shown in Figure 3a. 

Using the initial inferred RFB protocol model, the 
state-space explorer component of MACE discovered 4 
new input messages and refined the model with new 
edges without adding new states (Figure 3b). We manu- 
ally inspected the newly discovered output message (la- 
bel R6 in Figure 3b) and found that it represents an out- 
going message type not seen in the initial model. 

Since MACE found no new states that could be ex- 
plored with the state-space explorer, the process termi- 
nated. Through manual comparison with the RFB pro- 
tocol specification, we found that MACE has discovered 
all the input messages and all the states, except the states 
related to authentication and encryption, both of which 
we disabled in our experiments. Further, MACE found 
all the responses to client’s queries. 

We also performed an experiment with authentication 
enabled (encryption was still disabled). With this con- 
figuration, MACE discovered only three states, because 
it was not able to get past the checksum used during au- 
thentication, but discovered an infinite loop vulnerability 
that can be exploited for denial-of-service attacks. Due 
to space limits, we do not report the detailed results from 
this experiment, only detail the vulnerability found. 

Samba. For Samba, we collected a network trace 
of multiple SMB sessions, using Samba’s gentest test 


There are two other output message types that are triggered by the 
server’s GUI events and thus are outside of our scope. 
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suite, which generates random SMB operations for test- 
ing SMB servers. We used the default gentest configu- 
ration, with the default random number generator seeds. 
To abstract the outbound messages from the server, we 
used the SMB message type and status code fields; er- 
ror messages were abstracted into a single error message 
type. The Samba server received a total of 115 input mes- 
sages, from which MACE inferred an initial SMB model 
with 40 states, with 40 input and 14 output messages (af- 
ter filtering out redundant messages). 

In the second iteration, MACE discovered 14 new in- 
put and 10 new output messages and refined the initial 
model from 40 states to 84 states. The model converged 
in the third iteration after adding a new input and a new 
output message without adding new states. Table | sum- 
marizes all three inference rounds. 

Manually analyzing the inferred state machine, we 
found that some of the discovered input messages have 
the same type, but different parameters, and therefore 
have different effects on the server (and different roles 
in the protocol). MACE discovered all the 67 message 
types used in Samba, but the concrete messages gener- 
ated by the decision procedure during the state-space ex- 
ploration phase often had invalid message parameters, so 
the server would simply respond with an error. Such re- 
sponses do not refine the model and are filtered out dur- 
ing model inference. In total, MACE was successful at 


http://samba.org/~tridge/samba-testing/ 
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Label | Description 
1 client’s protocol version 
2 byte 0x01 (securityType=None, clientInit) 
3 setPixelFormat message 
4 setEncodings message 
5 frameBufferUpdateRequest message 
6 keyEvent message 
7 pointer event message 
8 clientCutText message 
9 byte 0x22 
10 malformed client’s protocol version 
11 frameBufferUpdateRequest message with 
bpp=8 and true-color=false 
12 malformed client’s protocol version 


(a) Input Legend. 








Label | Description 
RI server’s protocol version 
R2 server’s supported security types 
R3 serverInit message 
R4 framebufferUpdate message with default en- 








coding 

RS5 framebufferUpdate message with alternative 
encoding 

R6 setColourMapEntries message 

N no explicit reply from server 

T socket closed by server 


(b) Output Legend. 


Figure 4: Explanation of States and Input/Output Messages of the State Machine from Figure 3. 


pairing message types with parameters for 23 (out of 67) 
message types, which is an improvement of 10 message 
types over the test suite, which exercises only 13 differ- 
ent message types. 

We identified several causes of incompleteness in mes- 
sage discovery. First, message validity is configuration 
dependent. For example, the spoolopen, spoolwrite, 
spoolclose and spoolreturnqueue message types 
need an attached printer to be deemed valid. Our experi- 
mental setup did not emulate the complete environment, 
precluding us from discovering some message types. 
Second, a single echo message type generated by MACE 
induced the server to behave inconsistently and we dis- 
carded it due to our determinism requirement. Although 
this is likely a bug in Samba, this behavior is not reliably 
reproducible. We exclude this potential bug from the vul- 
nerability reports that we provide later. Third, our infras- 
tructure is unable to analyze the system calls and other 
code executed in the kernel space. In effect, the com- 
puted symbolic constraints are underconstrained. Thus, 
some corner-cases, like a specific combination of the 
message type and parameter (e.g., a specific file name), 
might be difficult to generate. This is a general problem 
when the symbolic formula computed by symbolic exe- 
cution is underconstrainted. 

In our experiments, we used Samba’s default configu- 
ration, in which encryption is disabled. The SMB proto- 
col allows null-authentication sessions with empty pass- 
word, similar to anonymous FTP. Thus, authentication 
posed no problems for MACE. 

MACE converged relatively quickly in both Vino and 
Samba experiments (in three iterations or less). We at- 
tribute this mainly to the granularity of abstraction. A 
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finer-grained model would require more rounds to infer. 
The granularity of abstraction is determined by the out- 
put abstraction function, (Section 3.1). 


6.3. Discovered Vulnerabilities 


We use the inferred models to guide the state-space ex- 
ploration of implementations of the inferred protocol. 
After each inference iteration, we count the number of 
newly discovered states, generate shortest transfer se- 
quences (Section 3.3) for those states, initialize the server 
with a shortest transfer sequence to the desired (newly 
discovered) state, and then run 2.5 hours of state-space 
exploration in parallel for each newly discovered state. 
The input messages discovered during those 2.5 hours 
of state-space exploration per state are then filtered and 
used for refining the model (Section 4.4). For the base- 
line dynamic symbolic execution without model guid- 
ance, we run |Q| parallel jobs with different random 
seeds for each job for 15 hours, where |Q| is the num- 
ber of states in the final converged model inferred for the 
target protocol. Different random seeds are important, 
as they assure that each baseline job explores different 
trajectories within the program. 

We rely upon the operating system runtime error de- 
tection to detect vulnerabilities, but other detectors, like 
Valgrind, could be used as well. Once MACE detects 
a vulnerability, it generates an input sequence required 
for reproducing the problem. When analyzing Linux ap- 
plications, MACE reports a vulnerability when any of the 
critical exceptions (SIGILL, SIGTRAP, SIGBUS, SIGFPE, 
and SIGSEGV) is detected. For Windows programs, 


http://valgrind.org/ 
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a vulnerability is found when MACE traps a call to 
ntdll.dll1::KiUserExceptionDispatcher and the 
value of the first function argument represents one of the 
critical exception codes. 

MACE found a total of seven vulnerabilities in Vino 
2.26.1, RealVNC 4.1.2, and Samba 3.3.4, within 2.5 
hours of state-space exploration per state. In compar- 
ison, the baseline dynamic symbolic execution without 
model-guidance, found only one of those vulnerabilities 
(the least critical one), even when given the equivalent 
of 15 hours per state. Four of the vulnerabilities MACE 
found are new and also present in the latest version of the 
software at the time of writing. The list of vulnerabilities 
is shown in Table 2. The rest of this section provides a 
brief description of each vulnerability. 

Vino. MACE found three vulnerabilities in Vino; all 
of them are new. The first one (CVE-2011-0904) is 
an out-of-bounds read from arbitrary memory locations. 
When a certain type of the RFB message is received, 
the Vino server parses the message and later uses two 
of the message value fields to compute an unsanitized 
array index to read from. A remote attacker can craft 
a malicious RFB message with a very large value for 
one of the fields and exploit a target host running Vino. 
The Gnome project labeled this vulnerability with the 
“Blocker” severity (bug 641802), which is the highest 
severity in their ranking system, meaning that it must 
be fixed in the next release. MACE found this vulner- 
ability after 122 minutes of exploration per state, in the 
first iteration (when the inferred state machine has seven 
states, Table 1). The second vulnerability (CVE-2011- 
0905) is an out-of-bounds read due to a similar usage 
of unsanitized array indices; the Gnome project labeled 
this vulnerability (bug 641803) as “Critical”, the second 
highest problem severity. This vulnerability is marked 
as a duplicate of CVE-2011-0904, for it can be fixed by 
patching the same point in the code. However, these two 
vulnerabilities are reached through different paths in the 
finite-state machine model and the out-of-bounds read 
happens in different functions. These two vulnerabilities 
are actually located in a library used by not only Vino, 
but also a few other programs. According to Debian se- 
curity tracker, kdenetwork 4:3.5.10-2 is also vulnerable. 

The third vulnerability (CVE-2011-0906) is an infinite 
loop, found in the configuration with authentication en- 
abled. The problem appears when the Vino server re- 
ceives an authentication input from the client larger than 
the authentication checksum length that it expects. When 
the authentication fails, the server closes the client con- 
nection, but leaves the remaining data in the input buffer 


http://security-tracker.debian.org/tracker/CVE-201 1-0904 
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queue. It also enters an deferred-authentication state 
where all subsequent data from the client is ignored. This 
causes an infinite loop where the server keeps receiv- 
ing callbacks to process inputs that it does not process 
in deferred-authentication state. The server gets stuck in 
the infinite loop and stops responding, so we classify this 
vulnerability as a denial-of-service vulnerability. Unlike 
all other discovered vulnerabilities, we discovered this 
one when L* hanged, rather than by catching signals or 
trapping the exception dispatcher. Currently, we have no 
way of detecting this vulnerability with the baseline, so 
we do not report the baseline results for CVE-2011-0906. 

Samba. MACE found 3 vulnerabilities in Samba. The 
first two vulnerabilities have been previously reported 
and are fixed in the latest version of Samba. One of 
them (CVE-2010-1642) is an out-of-bounds read caused 
by the usage of an unsanitized Security_Blob_Length 
field in SMB’s Session_Setup_AndX message. The other 
(CVE-2010-2063) is caused by the usage of an unsani- 
tized field in the “Extra byte parameters” part of an SMB 
Logoff_AndX message. The third one is a null pointer 
dereference caused by an unsanitized Byte_Count field 
in the Session_Setup-AndX request message of the SMB 
protocol. To the best of our knowledge, this vulnerability 
has never been publicly reported but has been fixed in the 
latest release of Samba. We did not know about any of 
these vulnerabilities prior to our experiments. 

RealVNC. MACE found a new critical out-of-bounds 
write vulnerability in RealVNC. One type of the RFB 
message processed by RealVNC contains a length field. 
The RealVNC server parses the message and uses the 
length field as an index to access the process memory 
without performing any sanitization, causing an out-of- 
bounds write. 

Win XP SMB. The implementation of Win SMB is 
partially embedded into the kernel, and currently our dy- 
namic symbolic execution system does not handle the 
kernel operating system mode. Thus, we were able to 
explore only the user-space components that participate 
in handling SMB requests. Further, we found that many 
involved components seem to serve multiple purposes, 
not only handling SMB requests, which makes their ex- 
ploration more difficult. We found no vulnerabilities in 
Win XP SMB. 


6.4 Comparison with the Baseline 


We ran several experiments to illustrate the improvement 
of MACE over the baseline dynamic symbolic execution 
approach. First, we measured the instruction coverage 
of MACE on the analyzed programs and compared it 
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Program Vulnerability Type Disclosure ID Iter. Jobs Search Time 
(|Q|) MACE Baseline 
perjob | total | per job total 
(min) | (hrs) (min) (hrs) 
Vino Wild read (blocker) | CVE-2011-0904 1/2 7 122 15 | >900 | >105 
Out-of-bounds read | CVE-2011-0905 1/2 7 31 4 >900 | >105 
Infinite loop CVE-2011-0906+ 1/2 7 1 1 N/A N/A 
Samba Buffer overflow CVE-2010-2063 1/3 84 88 124 | >900 | >1260 
Out-of-bounds read | CVE-2010-1642 1/3 84 10 14 | >900 | >1260 
Null-ptr dereference | Fixed w/o CVE 1/3 84 8 12 430 602 
RealVNC Out-of-bounds write | CVE-2011-0907 1/1 7 17 2 >900 >105 
Win XP SMB | None None None 84 >150 | >210 >900 >105 
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Table 2: Description of the Found Vulnerabilities. The upper half of the table (Vino and Samba) contains results for the reference 
implementations from which the protocol model was inferred, while the bottom half (Real VNC and Win XP SMB) contains the 
results for the other implementations that were explored using the inferred model (from Vino and Samba). The disclosure column 
lists Common Vulnerabilities and Exposures (CVE) numbers assigned to vulnerabilities MACE found. The new vulnerabilities 
are italicized. The + symbol denotes a vulnerability that could not have been detected by the baseline approach, because it lacks 
a detector that would register non-termination. We found it with MACE, because it caused L* to hang. The “Iter.” column lists 
the iteration in which the vulnerability was found and the total number of iterations. The “Jobs” column contains the total number 
of parallel state-space exploration jobs. The number of jobs is equal to the number of states in the final converged inferred state 
machine. The baseline experiment was done with the same number of jobs running in parallel as the MACE experiment. The 
MACE column shows how much time passed before at least one parallel state-space exploration job reported the vulnerability and 
the total runtime (number of jobs x time to the first report) of all the jobs up to that point. The “Baseline” column shows runtimes 
for the baseline dynamic symbolic execution without model guidance. We set the timeout for the MACE experiment to 2.5 hours 
per job. The baseline approach found only one vulnerability, even when allowed to run for 15 hours (per job). The >t entries mean 
that the vulnerability was not found within time f. 












Program Sequential Instruction Coverage Total crashes 
(Protocol) Time (Unique crashes) 
(min) MACE MACE 





Vino (RFB) 1200 | 129762 | 138232 6.53% 0(0)]| 2) 
Sam (SMB 2105) 
RealVNC (RFB) 1200 | 39300] 47557 21.01% 0(0)] 7(2) 
Win XP (SMB)+ 16775 | 90431 | 112820 24.76% 0(0)| 0) 


Table 3: Instruction Coverage Results. The table shows the instruction coverage (number of unique executed instruction addresses) 
of MACE after 2.5 hours of exploration per state in the final converged inferred state machine, and the baseline dynamic symbolic 
execution given the amount of time equivalent to (time MACE required for inferring the final state machine + number of states in 
the final state machine x 2.5 hours), shown in the second column. For example, from Table 1, we can see that Samba inference 
took the total of 2028 + 1840 + 307 = 4175 minutes and produced an 84-state model. Thus, the baseline approach was given 
84 x 150+ 4175 = 16775 minutes to run. The last two columns show the total number of crashes each approach found, and the 
number of unique crashes according to the location of the crash in parenthesis. Due to a limitation of our implementation of the 
state-space exploration (user-mode only), the baseline result for Windows XP SMB (marked +) was so abysmal, that comparing to 
the baseline would be unfair. Thus, we compute the Win XP SMB baseline coverage by running Samba’s gentest test suite. 
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against the baseline coverage. Second, we compared the 
number of crashes detected by MACE and by the base- 
line approach over the same amount of time. This num- 
ber provides an indication of how diverse the execution 
paths discovered by each approach are: more crashes im- 
plies more diverse searched paths. Finally, we compared 
the effectiveness of MACE and the baseline approach to 
reach deep states in the final inferred model. 


Instruction Coverage. In this experiment, we mea- 
sured the numbers of unique instruction addresses (i.e., 
EIP values) of the program binary and its libraries cov- 
ered by MACE and the baseline approach. These num- 
bers show how effective the approaches are at uncov- 
ering new code regions in the analyzed program. For 
Vino, RealVNC, and Samba, we used dynamic symbolic 
execution as the baseline approach and ran the experi- 
ment using the setup outlined in Section 6.1. We ran 
MACE allowing 2.5 hours of state-space exploration per 
each inferred state. To provide a fair comparison, we 
ran the baseline for the amount of time that is equal to 
the sum of the MACE’s inference and state-space explo- 
ration times. As shown in Table 3, our result illustrates 
that MACE provides a significant improvement in the in- 
struction coverage over dynamic symbolic execution. 


As mentioned before, our tool currently works on user- 
space programs only. Because Windows SMB is mostly 
implemented as a part of the Windows kernel, the results 
of the baseline approach were abysmal. To avoid a straw 
man comparison, we chose to compare against Samba’s 
gentest test suite, regularly used by Samba developers 
to test the SMB protocol. Using the test suite, we gen- 
erate test sequences and measure the obtained coverage. 
As for other experiments, we allocated the same amount 
of time to both the test suite and MACE. The experimen- 
tal results clearly show MACE’s ability to augment test 
suites manually written by developers. 


Number of Detected Crashes. Using the same setup 
as in the previous experiment, we measured the num- 
ber of crashing input sequences generated by each ap- 
proach. We report the number of crashes and the num- 
ber of unique crash locations. From each category of 
unique crash locations, we manually processed the first 
four reported crashes. All the found vulnerabilities (Ta- 
ble 2) were found by processing the very first crash in 
each category. All the later crashes we processed were 
just variants of the first reported crash. MACE found 30 
crashing input sequences with 9 of them having unique 
crash locations (the EIP of the crashed instruction). In 
comparison, the baseline approach only found 20 crash- 
ing input sequences, all of them having the same crash 
location. 
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Figure 5: SMB Exploration Depth. The inferred state 
machine can be seen as a directed graph. Suppose we 
compute a spanning tree (e.g., [13]) of that graph. The 
root of the graph is at level zero. Its children are at level 
one, and so on. The figure shows the percentage of states 
visited at each level by MACE and the baseline approach. 
The numbers above points show the number of visited 
states at the given depth. The shaded area clearly shows 
that MACE is superior to the baseline approach in reach- 
ing deep states of the inferred protocol. 


Exploration Depth. Using the same setup as for the 
coverage experiment, we measured how effective each 
approach is in reaching deep states. The inferred state 
machine can be seen as a directed graph. Suppose we 
compute a spanning tree (e.g., [13]) of that graph. The 
root of the graph is at level zero. Its children are at 
level one, and so on. We measured the percentage of 
states reached at every level. Figure 5 clearly shows that 
MACE is superior to the baseline approach in reaching 
deep states in the inferred protocol. 


7 Limitations 


Completeness is a problem for any dynamic analysis 
technique. Accordingly, MACE cannot guarantee that 
all the protocol states will be discovered. Incomplete- 
ness stems from the following: (1) each state-space ex- 
plorer instance runs for a bounded amount of time and 
some inputs may simply not be discovered before the 
timeout, (2) among multiple shortest transfer sequences 
to the same abstract state, MACE picks one, potentially 
missing further exploration of alternative paths, (3) sim- 
ilarly, among multiple concrete input messages with the 
same abstract behavior, MACE picks one and considers 
the rest redundant (Definition 2). 

Our approach to model inference and refinement is 
not entirely automatic: the end users need to provide an 
abstraction function that abstracts concrete output mes- 
sages into an abstract alphabet. Coming up with a good 
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output abstraction function can be a difficult task. If the 
provided abstraction is too fine-grained, model inference 
may be too expensive to compute or may not even con- 
verge. On the other hand, the inferred model may fail to 
distinguish two interesting states if the abstraction is too 
coarse-grained. Nevertheless, our approach provides an 
important improvement over our prior work [11], which 
requires abstraction functions for both input and output 
messages. 

When using our approach to learn a model of a pro- 
prietary protocol, a certain level of protocol reverse- 
engineering is required prior to running MACE. First, we 
need a basic level of understanding of the protocol inter- 
face to be able to correctly replay input messages to the 
analyzed program. For example, this may require over- 
writing the cookie or session-id field of input messages 
so that the sequence appears indistinguishable from real 
inputs to the target program. Second, our approach re- 
quires an appropriate output abstraction, which in turn 
requires understanding of the output message formats. 
Message format reverse-engineering is an active area of 
research [14, 15, 6] out of the scope of this paper. 

Encryption is a difficult problem for every (existing) 
protocol inference technique. To circumvent the issue, 
we configure the analyzed programs not to use encryp- 
tion. However, for proprietary protocols, such a con- 
figuration may not be available and techniques [5, 29] 
that automatically reverse-engineer message encryption 
are required. 
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9 Conclusions and Future Work 


We have proposed MACE, a new approach to software 
state-space exploration. MACE iteratively infers and re- 
fines an abstract model of the protocol, as implemented 
by the program, and exploits the model to explore the 
program’s state-space more effectively. By applying 
MACE to four server applications, we show that MACE 
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(1) improves coverage up to 58.86%, (2) discovers sig- 
nificantly more vulnerabilities (seven vs. one), and (3) 
performs significantly deeper search than the baseline 
approach. 

We believe that further research is needed along sev- 
eral directions. First, a deeper analysis of the correspon- 
dence of the inferred finite state models to the structure 
and state-space of the analyzed application could reveal 
how models could be used even more effectively than 
what we propose in this paper. Second, it is an open 
question whether one could design effective automatic 
abstractions of the concrete input messages. The filter- 
ing function we propose in this paper is clearly effective, 
but might drop important messages. Third, the finite- 
state models might not be expressive enough for all types 
of applications. For example, subsequential transducers 
[28] might be the next, slightly more expressive, repre- 
sentation that would enable us to model protocols more 
precisely, without significantly increasing the inference 
cost. Fourth, MACE currently does no white box analy- 
sis, besides dynamic symbolic execution for discovering 
new concrete input messages. MACE could also monitor 
the value of program variables, consider them as the in- 
put and the output of the analyzed program, and automat- 
ically learn the high-level model of the program’s state- 
space. This extension would allow us to apply MACE to 
more general classes of programs. 
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Abstract 


Access control vulnerabilities, which cause privilege es- 
calations, are among the most dangerous vulnerabilities 
in web applications. Unfortunately, due to the difficulty 
in designing and implementing perfect access checks, 
web applications often fall victim to access control at- 
tacks. In contrast to traditional injection flaws, access 
control vulnerabilities are application-specific, rendering 
it challenging to obtain precise specifications for static 
and runtime enforcement. On one hand, writing specifi- 
cations manually is tedious and time-consuming, which 
leads to non-existent, incomplete or erroneous specifica- 
tions. On the other hand, automatic probabilistic-based 
specification inference is imprecise and computationally 
expensive in general. 

This paper describes the first static analysis that au- 
tomatically detects access control vulnerabilities in web 
applications. The core of the analysis is a technique that 
statically infers and enforces implicit access control as- 
sumptions. Our insight is that source code implicitly doc- 
uments intended accesses of each role and any successful 
forced browsing to a privileged page is likely a vulner- 
ability. Based on this observation, our static analysis 
constructs sitemaps for different roles in a web applica- 
tion, compares per-role sitemaps to find privileged pages, 
and checks whether forced browsing is successful for 
each privileged page. We implemented our analysis and 
evaluated our tool on several real-world web applications. 
The evaluation results show that our tool is scalable and 
detects both known and new access control vulnerabilities 
with few false positives. 


1 Introduction 


Web applications often restrict privileged accesses to au- 
thorized users. While bringing the convenience of ac- 
cessing a large amount of information and operations 
from anywhere into people’s daily lives, web applications 
have opened a new door for attacks and the number of 
web-based attacks is on the rise. A Symantec Internet 
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security threat report published in April 2011 points out 
that the volume of web-based attacks in 2010 increased 
by 93% over the volume observed in 2009!. Researchers 
of web security have focused their attention on injection 
vulnerability, which is the most common vulnerability in 
web applications. Although not as prevalent as injection 
vulnerability, access control vulnerability poses a more se- 
rious threat because of exposed privileges, and has started 
attracting the attention of researchers [7]. Compared with 
those in traditional software, access checks in web ap- 
plications are harder to get right because of the stateless 
nature of the HTTP protocol. In traditional software, once 
a user has passed an authentication check, the system 
remembers the identity of the user until she logs out or 
a timeout event happens. This is not the case for web 
applications, which must parse each new HTTP request 
to identify a previously logged-in user. A statistics re- 
port published in 2007 states that 14.15% of the surveyed 
web applications suffer from vulnerabilities of insufficient 
authorization’. 

Traditional injection vulnerabilities such as Cross-Site 
Scripting (XSS) and SQL injection are not application- 
specific and have a clear and general definition [25]: an in- 
jection vulnerability exists when an untrusted input flows 
into a sensitive sink without proper sanitization. To detect 
injection vulnerabilities, it is sufficient to analyze indi- 
vidual pages separately to examine where untrusted user 
inputs can flow. In contrast, access control vulnerabilities 
are application-specific, and it is necessary to examine 
connections between pages. 

Web application developers frequently make implicit 
assumptions of allowed accesses and protect privileged 
pages by hiding links to these pages from unauthorized 
users. However, security by obscurity is insufficient to 
prevent a determined and skilled attacker from accessing 
these pages, viewing sensitive data or performing dan- 
gerous operations. As an example, Business Wire used a 


'http://www.symantec.com/business/threatreport 
*http://projects. webappsec.org/f/wasc_wass_2007.pdf 
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web server to store files of important trade information, 
which were supposed to be accessible to registered mem- 
bers only. Although the URLs to these files were hidden 
in the presentation layer from unauthorized users, the 
date-based URLs were highly predictable. By simply ac- 
cessing these privileged files, an investment bank Lohmus 
Haavel & Viisemann profited over eight million dollars 
based on the disclosed trade information?. Similarly, in 
November 2010, Blooming News obtained and published 
valuable financial earnings data of Disney and NetApp 
to its subscribers hours before official data releases by 
predicting resource locations inside secure corporate net- 
works. As yet another example, accesses to the videos 
of USENIX conference presentations are restricted to 
USENIX members for a short period after a conference. 
However, the authors of this paper were able to predict 
the author-name-based URLs of the videos and download 
a few videos as public users. 

Researchers have proposed various static and dynamic 
analysis techniques [1, 7, 10, 13] to detect violations of 
application logic, including access control attacks. Unfor- 
tunately, these techniques have limited effectiveness on 
detecting access control vulnerabilities. Dynamic analy- 
ses have difficulty finding hidden pages and determining 
intended accesses for each role. Furthermore, sitemaps 
covered by dynamic executions tend to be shallow and 
incomplete as user inputs are usually limited. Despite that 
static analyses typically have better coverage, they often 
require good specifications in order to generate useful 
reports, whose false positives do not overwhelm users. 
In practice, deriving precise specifications is challenging, 
especially when diverse authentication and access control 
management schemes are in use. As manually writing 
specifications is time-consuming and probabilistic-based 
inference is error-prone, it is desirable to precisely in- 
fer implicit assumptions on intended accesses from the 
source code of applications. 

In this paper, we use role to represent a unique set 
of privileges that a group of users has. Most web ap- 
plications have at least three types of roles: the role for 
administrators, the role for normal logged-in users and 
the role for public or anonymous users. Access control 
checks must be performed before granting access to any 
privileged resource to prevent privilege escalation attacks. 
When implicit assumptions are not matched by explicit 
access checks, unauthorized accesses are possible. 

We propose the first role-based static analysis to detect 
access control vulnerabilities with automatic inference on 
implicit access control assumptions. Our key observations 
are that each role represents a unique set of privileges, and 
intended accesses for each role are reflected in explicit 
links shown in the presentation layer of an application. 
Guided by these observations, our analysis automatically 


3http://www.whitehatsec.com/home/assets/WP_bizlogic092407.pdf 
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derives specifications on privileged accesses by compar- 
ing explicit links presented to different roles. It then 
directly accesses privileged pages for unprivileged roles, 
and examines whether these accesses are allowed to de- 
tect vulnerable pages which have missing or insufficient 
access checks. Our main contributions are: 


¢ A formal definition of access control vulnerabilities 
in web applications. 


¢ The first role-based static analysis which automat- 
ically detects access control vulnerabilities in web 
applications with minimal manual efforts. 


e An implementation of our analysis which constructs 
intended per-role sitemaps. Given role-based speci- 
fications, our prototype can systematically explore 
feasible execution paths based on the satisfiability of 
constraints. 


¢ An evaluation of our tool on real-world web appli- 
cations. Our tool works on unmodified code, and 
is able to detect both new and known vulnerabili- 
ties before the deployment of web applications. The 
evaluation results show that our approach is scalable 
and effective, with few false positives. 


The rest of the paper is organized as follows. We first 
use an example to illustrate the main steps of our ap- 
proach (Section 2) and then present our formalization 
of access control vulnerability in web applications (Sec- 
tion 3). Section 4 describes our detailed algorithms. Sec- 
tion 5 presents the implementation details of our static 
analyzer, and Section 6 shows the effectiveness, coverage 
and performance of our analyzer on real-world web appli- 
cations. Finally, we survey related work (Section 7) and 
conclude (Section 8). 


2 Illustrative Example 


Figure 1 shows a simple web application based on one 
of the real-world web applications in our test suite. For 
illustration, suppose that the application has two roles: 
role a for administrators and role b for normal users. In 
our approach, we require developers to only specify ap- 
plication entry points and role-based application states, 
which serve as the basis for automatically inferring the set 
of privileged pages. Suppose that in the given specifica- 
tions, the entry sets for both roles are identical and contain 
only “index.php”, and the value of $_SESSION|“admin”| 
is specified as true for role a but false for role b. As 
we can see from the source code, only “functions.php” 
checks accesses. This file is included via PHP inclu- 
sion in both “index.php” and “user_delete.php”, but not 
“user_add.php.” Consequently, access checks are missing 
in “user_add.php” but present in the other three pages. 
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index.php 
user_delete.php <?php user_add.php 
<?php include (“functions.php”) ; 
ud m : ies Sadd = “user _add.php”; <?php 
papper eee 3 a $del = “user delete.php”; a add_user(); 
— ’ ‘ echo “<a href=" . S$add . “>Add User</a>”; tee 
— echo “<a href=” . $del . “>Delete User</a>”; 2 
<e © 
| 
|a, b 





functions.php 





<?php 
session_start (); 


} 


22> 





if (!$_SESSION[“admin”]) { 
die (“Access denied!”); 





@ 





Figure 1: An Example of Access Control Vulnerability. Solid arrows represent explicit links, and dashed arrows 
represent inclusion relationship between pages. Arrows correspond to edges in sitemaps and are labeled with 
roles. The intended sitemap for privileged role a has four edges while the intended sitemap for role b has only 


one edge. 


The first step of our analysis constructs per-role 
sitemaps with a worklist-based algorithm. Initially, work- 
lists for both roles are [“index.php”]. While a worklist is 
not empty, our analysis pops a work node from the front 
of the worklist each time. Let us look at the sitemap 
construction for role a first. The first analyzed node 
is “index.php”. From this node, users of role a can ex- 
plicitly reach both “user_add.php” and “user_delete.php” 
via anchor tags, and “functions.php” via a file inclusion. 
Thus, our analysis adds three new edges in the sitemap 
and appends the newly discovered nodes to the worklist, 
which is now [“user_add.php”, “user_delete.php”, “func- 
tions.php”]. The second analyzed node is “user_add.php”’. 
This node can not reach any nodes, and thus our anal- 
ysis pops “user_delete.php” and the worklist becomes 
[“functions.php”]. Role a can reach “functions.php” from 
“user_delete.php’”’, and thus our analysis adds a new edge 
in the sitemap. Because “functions.php” is already in 
the worklist, it is not appended to the current worklist. 
Finally, our analysis pops “functions.php”. This node can 
not reach any nodes and our analysis stops because the 
worklist is now empty. Now let us look at the sitemap 
construction for role b. The first popped node is still 
“index.php”. However, role b can only explicitly reach 
“functions.php” via a file inclusion from this node. The 
links to “user_delete.php” and “user_add.php” are hidden 
from users of role b in “index.php” via the access check 
in “functions.php”. Therefore, our analysis adds only one 
new edge and stops because the worklist is now empty. 
The edges of constructed per-role sitemaps are shown in 
Figure 1. 


The second step of our analysis infers the set of 
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privileged pages and attempts to access these pages di- 
rectly to detect access control vulnerabilities. Com- 
paring the sets of explicitly reachable nodes for role a 
and role b, our analysis infers that “user_add.php” and 
“user_delete.php” are privileged pages intended for users 
of role a only. Consequently, these two pages should 
have access checks to ward off users of role b. Un- 
fortunately, only “user_delete.php” is safeguarded and 
“user_add.php” is left unprotected. Therefore, a direct ac- 
cess to “user_delete.php” fails, whereas a direct access to 
“user_add.php” succeeds, indicating that “user_delete.php” 
is guarded and “user_add.php” is vulnerable. 


3 Approach Formulation 


This section formulates our high-level approach. We de- 
fine the notions of role, explicit link, forced browsing, 
web application and access control vulnerability, and 
present two assumptions we make with regard to roles 
and intended accesses. 


Definition 1 (Role). A role r € R captures the set of 
allowed accesses for all users of role r where set R denotes 
roles that a web application has. Each role r represents a 
distinctive set of privileges. 


Assumption 1 We assume that roles in R form a lat- 
tice (R,C), where C denotes the ordering relationship 
between any two roles. Under this assumption, accessing 
a privileged resource as an unprivileged role is considered 
a privilege escalation attack. Roles at the same level of 
the lattice are not ordered by EC as they may represent 
different sets of allowed accesses. The role for adminis- 
trators is T; the role for public users is L; and the role for 
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normal logged-in users lies in the middle of the lattice. 


Definition 2 (Explicit Link). In a web application, 
there exists an explicit link from page n; to a different 
page n; when it is possible to jump to n; via an explicit 
URL in nj, incurring no exceptions or errors. URLs might 
appear in file inclusions, header redirections, HTML tags 
for anchors, forms, meta refresh headers, frames, iframes, 
scripts, images or links. 


Definition 3 (Forced Browsing). Forced browsing is 
the act of directly accessing privileged pages rather than 
following explicit links in a web application. Attackers of- 
ten harness brute force techniques to access hidden pages 
with predictable locations. We consider forced browsing 
successful when HTML pages presented to two differ- 
ent roles are identical, and no redirections, exceptions or 
errors occur during the page rendering process. 


Definition 4 (Web Application). Let node represent 
a web page. Suppose that a web application contains 
k nodes. Given a user role r € R, we abstract the web 
application as P, = (S,,Q,,E;,1,,0,,N,), where 


¢ Entry set S, contains the entry nodes to the web 
application. We include index pages in all directories 
in the entry set. Different roles may have different 
entry sets. 


* State set Q, = {qi |0 <i<k} is a set of applica- 
tion states. For each node n;, an application state q; 
captures critical information at that node. It might 
include session values, cookie values, request pa- 
rameter values, database records, variable values or 
function return values. 


° Explicit edge set E, = {(nj,nj) |O0<i,j <k}. An 
explicit edge from node n; to n; exists iff n; in state 
qi contains an explicit link to n;. 


¢ Implicit edge set I, = {(nj,nj) |O0<i,j <k}. An 
implicit edge from node n; to n; exists iff forced 
browsing enables one to jump to 7; from n; in state 
qi. Accesses via implicit edges are allowed but often 
unintended. 


¢ Navigation path set Tl, = {(nj)o<ic1 |O<I< kA 
no € S; A (ni,nizi) € (E,UI,)}. It consists of all 
possible navigation paths for role r, including ex- 
plicit edges as well as implicit edges. 


¢ Explicitly reachable node set N, consists of nodes 
that are reachable from application entries in S, via 
explicit edges in E,. It can be easily computed with 
a graph reachability analysis. 


Assumption 2 For each node in a web application, if 
multiple roles can reach this node on navigation paths 
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composed of only explicit edges, we assume that the 
privilege level required to access this node is determined 
by the least privileged role. 


Definition 5 (Access Control Vulnerability). Let 
a,b € R denote two roles that can be ordered in a web 
application where role b is less privileged than role a, i.e., 
bCa. An access control vulnerability exists at node n 
when: 








n&€ NaAn€ Np AA Mp € My (n € Mp) 


In this definition, destination node n is a privileged 
node intended to be accessible to role a but not role b. We 
use 1 € 7p to denote that n is on navigation path 7). This 
node is vulnerable to access control attacks when a user 
of role b is able to access n via an allowed, but probably 
unintended, navigation path 7). 


4 Analysis Algorithm 


In this section, we introduce the three major algorithms 
of our approach. Section 4.1 describes how our analysis 
automatically infers specifications of implicit access con- 
trol assumptions and detects access control vulnerabilities 
from a high-level view. Section 4.2 shows the algorithm 
that we use to build per-role sitemaps. Finally, we present 
the detailed link extraction algorithm in Section 4.3. 


4.1 Vulnerability Detection 


Figure 2 presents the vulnerability detection algorithm 
which is the core of our approach. This algorithm infers 
privileged nodes from the source code of a web applica- 
tion and identifies nodes that are not properly protected. 


DETECTVULS(Spec,, Specy, reg) 
1 Vuls 

nfa <~ REG2NFA (reg) 
dfa ~~ NFA2DFA (nfa) 
Na < BUILDSITEMAP(Spec,, dfa) 
Np + BUILDSITEMAP(Specy,, dfa) 
Privileged — Na \ Np 
for each n in Privileged 
do (cfg,,Ra) <_ GETCFG(n, Spec,) 

9 (cfgp,Rp) <-_ GETCFG(n, Spec; ) 
10 if SIZEOF(cfg,,) = SIZEOF(cfg,) and R, = R, 
11 then Vuls — VulsU {n} 
12 return Vuls 


AADNFKWNHY 





Figure 2: Algorithm for Vulnerability Detection. 


Let Spec, and Spec), denote specifications for role a and 
role b respectively. Initially, the set of vulnerable nodes 
Vuls is empty. First, this algorithm parses the regular 
expression reg, which captures HTML tags where a link 
might appear, into a non-deterministic finite automaton 
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(NFA). Then, the algorithm transforms the NFA into a 
deterministic finite automaton (DFA). Either NFA or DFA 
could be used for extracting links, and we chose DFA for 
its advantage on performance and the ease of FA state 
management. 

Throughout this paper, we assume role a is more priv- 
ileged than role b. Following Definition 4, we use Nj 
and N, to denote the sets of explicitly reachable nodes 
for roles a and b respectively. Function BUILDSITEMAP, 
whose details are shown later in Section 4.2, computes 
these two sets. Relying on Assumption 2, the algorithm 
infers privileged nodes that are present in NV, but not in N;, 
(Line 6). For the example in Section 2, N, ={“index.php”, 
“user_add.php”, “user_delete.php”, “functions.php”} and 
N, ={‘‘index.php”, “functions.php” }. 

Access checks at privileged locations may be missing 
or insufficient. This algorithm analyzes each privileged 
node n twice with function GETCFG, once for role a to 
create an oracle for the intended server response (Line 8), 
and once for role b to emulate forced browsing (Line 9). 
Given a role r and a privileged node n, GETCFG returns 
a context-free grammar (CFG) cfg, and the set of page 
redirections R,.* The obtained cfg, is an approximation 
of the dynamic HTML output of node n. We observe that 
when an access check succeeds, users are often granted 
accesses to sensitive information or operations; otherwise, 
they are redirected to another page, or presented with error 
messages or login forms. In the latter case, CFG sizes of 
the two roles are different because of the different HTML 
outputs that are presented. Consequently, if the sizes of 
the two CFGs or the two redirection sets differ, node n is 
considered guarded; otherwise, n may be vulnerable (Line 
11). For the privileged page “user_delete.php” shown in 
Figure 1, SIZEOF(cfg,) 4 SIZEOF(cfg,) and Ra = R, = 
@, indicating that the page is guarded; for the privileged 
page “user_add.php”, SIZEOF(cfg,,) = SIZEOF(cfg,,) and 
Ra = Rp = 9, indicating that the page is vulnerable. 


4.2 Building Sitemaps 


Function BUILDSITEMAP shown in Figure 3 builds a per- 
role sitemap with specifications Spec, for role r and the 
DFA dfa. We use a worklist-based algorithm to traverse 
nodes in a web application in a breath-first manner. Ini- 
tially, both the visited node set Visited and the edge set E, 
are empty, and the worklist WkLst is initialized with the 
entry set S, specified in Spec, (Line 3). 

In each iteration of the loop, function GETWORKN- 
ODE pops a working node n; from the front of list WkLst 
and retrieves its associated state q; from Spec, (Line 5) 
to find outgoing edges of this working node. Next, this 
algorithm constructs a CFG that represents the possible 
HTML outputs of node n; (Line 6). Besides cfg;, function 


“Throughout this paper, CFG stands for context-free grammar rather 
than control-flow graph. 
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BUILDSITEMAP(Spec,,dfa) 
1 £E,<0 

Visited — 0 

WkLst ~ GETENTRIES(Spec;) 

while WkLst 

do (n;,q;) < GETWORKNODE(WKLst, Spec;) 
(efg;,Ri, Fi) < CONSTRUCTCFG (nj, qi) 
L; < EXTRACTLINKS(cfg,,dfa) 
Nj << L,UR; UF; 

9 for each n; in N; 

10 do E, — E,U {(nj,nj)} 

11 Visited + Visited U {n;} 

12 N + ACTIVE(N)) \ (Visited U WkLst) 

13 WkLst < APPEND(WKLst,N) 

14 return GETNODES(E,;) 


AOADNAFWPL 





Figure 3: Algorithm for Building Sitemaps. 


CONSTRUCTCFG also returns the page redirection set R; 
and the file inclusion set F; as links in these two sets also 
contribute to outgoing edges in a sitemap. Then, function 
EXTRACTLINKS extracts a set of matched links L; that 
are present in cfg; based on dfa (Line 7). The details of 
EXTRACTLINKS are presented later in Section 4.3. The 
set of reachable nodes N; for n; is the union of L;, R; and 
F; (Line 8). We conservatively include F; in this union be- 
cause included files may present sensitive information or 
operations. The algorithm adds an outgoing edge (nj;,n;) 
to the explicit edge set E, for each node n; € N; (Line 
10) and then adds n; to the visited node set (Line 11). To 
determine which nodes to analyze, we partition nodes into 
active nodes and inactive nodes, and only analyze active 
ones. Active nodes may have outgoing edges in a sitemap, 
whereas inactive nodes are dead ends. For example, a 
PDF file is considered an inactive node, while a PHP page 
is considered an active node. Finally, the algorithm adds 
the newly discovered active nodes to the worklist, exclud- 
ing the ones that have been visited or are already in the 
worklist (Line 12, 13). The loop terminates when WkLst 
becomes empty, indicating that the construction of a per- 
role sitemap is complete. At this point, function BUILD- 
SITEMAP returns the set of explicitly reachable nodes N, 
based on E, (Line 14). When work node n; =“‘index.php” 
shown in Figure | is analyzed for role a in a loop iter- 
ation, L; ={“user_delete.php”, “user_-add.php”}, R; = 0 
and F; ={“functions.php’}. Therefore, three new outgo- 
ing edges from “index.php” are added to E,. In contrast, 
when “index.php” is analyzed for role b, L; = R; = 9 and 
F; ={“functions.php”}. In this case, only one new edge 
is added to Ep. 
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4.3 Link Extraction 


We use C to denote a CFG, and F to denote an FA. In our 
setting, a CFG represents the dynamic HTML output of a 
node and an FA matches a single link-introducing HTML 
tag of various forms. Let (C) be the set of words in the 
language for the CFG and .#(F) be the set of words in 
the language for the FA. Suppose that function SUBSTR 
returns true only when w’ is a substring of w. The output 
of EXTRACTLINKS on C and F is defined as follows: 


EXTRACTLINKS(C,F) ={w' |we Y(C)A 
weEZL(F)A 
SUBSTR(w’,w) } 


We could use a straight-forward three-step approach 
to extract links. In the first step, we could use the stan- 
dard CFG-reachability algorithm [20] to compute a CFG 
representing the intersection of the two languages for C 
and F’, where F’ matches HTML outputs that contain 
at least one link-introducing tag. The subtle difference 
between F’ and F is that F’ matches link-introducing tags 
as well as link-irrelevant HTML outputs, while F only 
matches link-introducing tags. In the second step, we 
could generate all possible HTML outputs of the CFG. In 
the third step, we could use an HTML parser to extract 
links from the generated HTML outputs. Nevertheless, 
this approach is not ideal for two reasons. The first is 
that the words of a CFG can be infinite and we can only 
generate a finite set of possible HTML outputs. The sec- 
ond is that the generated HTML outputs are likely being 
highly similar, and thus we may repetitively parse similar 
HTML outputs. For better performance, we designed a 
new algorithm that does not generate intermediate HTML 
outputs, but directly extracts links from the CFG. 

In a CFG (V,z,P, So), V is a finite set of variables (i.e. 
non-terminals); & is a finite set of terminals which is the 
alphabet of the language; P = {v > rhs |v © V Arhs © 
(V UX)*} is a finite set of grammar productions; and So is 
the start variable. In an FA (Q,’,q0,6,Q,), Q is a finite, 
non-empty set of states; L’ is the input alphabet; go € Q 
is the start state; 6 : Q x © > Q is the state-transition 
relation; and Q+ C Q is the set of final states. 

Figure 4 shows our link extraction algorithm where 
function EXTRACTLINKS is the entry point. We use set 
VOW to store (v,g,w) tuples where v represents a CFG 
variable, g is an FA state and w is a partially matched link 
string. Completely matched links are stored in set Words. 
To begin with, this algorithm walks the CFG with the start 
CFG symbol So, the start FA state gg, and the empty string 
which represents the terminals that have been partially 
matched (Line 38). 

Function WALKTERMINAL is the only function that 
advances an FA state q to a new state q’ based on the FA 
transition function 6 and an input character ¢ (Line 1). If 
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WALKTERMINAL(t,q,w) 
1 q—6(q,t) 

2 if¢' =4q0 

3.‘ then return (qo, “”’ ) 

4 w' + APPEND(w,f) 

5 ifg' <Q; 

6 then Words ~— Words U {w’} 
7 J __ 6a 

8 


w= 
return (q,w’) 


WALKVAR(V,q,w) 

10 VOW + VOWU{(v,q,w)} 

11 RHS < PRODUCTIONS(v, P) 

12 if ISSIGMA(RHS) or RHS =0 
13 then return {(g,w)} 

14 OW+<0 

15 for each rhs in RHS 

16 doif ISEPSILON(rhs) 

17 then OW + OWU{(q,w)} 
18 else QW + QWU WALKS YMBOLS(rhs,q,w) 
19 return OW 


WALKS YMBOL(s,QW) 

21 Result —O 

22 for each (q,w) in QW 
23 doif ISTERMINAL(s) 


24 then QW’ ~— {WALKTERMINAL(s,q,w)} 
25 else if (s,qg,w) © VOW 

26 then QW’ + {(q,w)} 

27 else QW’ ~ WALKVAR(s,q,w) 


28 Result ~ Result UQW' 
29 return Result 


WALKS YMBOLS (rhs = [7],q,w) 

31 QW+ {(q,w)} 

32 for each s; in [y| 

33. do QW + WALKS YMBOL(s;,QW) 
34 return OW 


EXTRACTLINKS (cfg = (V,Z, P,So), fa = (Q,=',qo,5,QF)) 
36 VOWCO 

37 Words +O 

38 WALKVAR(So,qo, “”’ ) 

39 return VALID(Words) 





Figure 4: Algorithm for Link Extraction. 
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Figure 5: System Architecture. 


qd is the FA start state go, which indicates a mismatch, 
the algorithm clears the partially matched terminals and 
returns (Line 3); otherwise, it appends ¢ to w (Line 4) and 
examines q’ again (Line 5). If g’ is a final FA state in QO, 
the algorithm adds the completely matched link to Words 
(Line 6) and resets w’ to the empty string. In this way, we 
filter out noises that are irrelevant to links in the CFG and 
only keep track of link-introducing HTML outputs. 


Recursive function WALKVAR walks the grammar pro- 
ductions of variable v under an FA state g and a partially 
matched word w. Function PRODUCTIONS retrieves the 
set of productions which have v as the left-hand-side vari- 
able from the CFG production set P, and returns the set of 
right-hand sides RHS (Line 11). The different elements 
in RHS indicate how the dynamic HTML output might 
diverge for v. Function ISSIGMA checks whether a set is 
equivalent to the CFG alphabet 2. A link of value &* can 
point to any file in the application and therefore should 
be discarded. If RHS forms the alphabet or the empty 
set, the function returns the pair of unchanged g and w 
in a set (Line 13); otherwise, it walks the elements in set 
RHS one by one. In each loop iteration, if a right-hand 
side rhs has no symbols, the HTML output remains the 
same (Line 17); otherwise, the algorithm searches the set 
of new possible outcomes QW’ with a call to function 
WALKS YMBOLS (Line 18). 


Recursive function WALKS YMBOLS walks the sym- 
bols in list [y| in order. Consequently, links in the CFG 
are matched in the order of their appearances in a possible 
HTML output. Here [y] = (s;)* As; € (V UX), represent- 
ing a sequence of right-hand-side symbols. For each 
symbol s; in the list, the algorithm transitions the set of 
possible outcomes to a new set (Line 33). 


Recursive function WALKSYMBOL walks a right-hand- 
side symbol s under each possible outcome (q,w). In each 
loop iteration, the algorithm first examines the symbol 
s (Line 23). If s is a terminal, the FA state is determin- 
istically advanced via function WALKTERMINAL (Line 
24). Otherwise, if the symbol is a variable, this algorithm 
recursively calls function WALKVAR for s (Line 27) when 
v is associated with a new q or a new w. The use of set 
VOW ensures the termination of the algorithm. This al- 
gorithm stops when all reachable grammar productions 
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have been explored at least once. A concrete example of 
how this algorithm works is given in Section 5.2.2. 


5 Implementation 


As PHP is one of the most popular programming lan- 
guages for web applications, we implemented our ap- 
proach by extending Wassermann and Minamide’s PHP 
string analyzer [21, 30], which is written in OCaml. The 
original PHP string analyzer was developed to detect in- 
jection vulnerabilities in web applications, and it analyzes 
individual pages in isolation and explores all execution 
paths. To detect access control vulnerabilities, we mod- 
ified the string analyzer to build per-role sitemaps and 
examine connections between different pages. In par- 
ticular, we introduced the concept of role into the static 
analyzer, added new specification rules for application 
states and entry sets, and strategically explored paths 
based on branch feasibilities. To explore only feasible 
execution paths, we keep track of both arithmetic con- 
straints and string constraints. For arithmetic constraints, 
the analyzer consults a Satisfiability Modulo Theories 
(SMT) solver Z3 [8]; for string constraints, it consults a 
custom-built string constraint solver. Furthermore, we de- 
signed and implemented the algorithm shown in Figure 4 
to efficiently extract explicit links from CFGs, added sup- 
port for 176 built-in PHP functions, and modified both the 
specification lexer and parser to support specifications for 
the values of integers, floating-point numbers and strings. 

Figure 5 shows our system architecture. A web appli- 
cation can have multiple roles, and our analysis compares 
a pair of ordered roles each time. Initially, the DFA con- 
structor transforms the given regular expression reg into 
a DFA. The detection of access control vulnerabilities is 
carried out in two major steps. First, the sitemap builder 
explores the given web application based on parsed speci- 
fications and the DFA. Second, the reachable nodes com- 
parator infers what privileged nodes are, and the vulnera- 
bility detector performs forced browsing to detect nodes 
that are vulnerable to access control attacks. 


5.1 Specification Rules 


In our analysis, specifications are parsed with a lexer and 
a parser. For each role r, we only require developers to 
specify the entry set S, and the set of critical application 
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states Q,. Multiple roles can share the same set of en- 
try points. Either index pages or active pages with no 
incoming edges can be entry nodes. Index pages often 
have conventional names such as “index.php” and “in- 
dex.html’”, and can be easily identified with a file scan; 
active pages with no incoming edges can be specified as 
entry nodes by developers. The types of application states 
that we support are listed in Definition 4. The state values 
that can be specified include abstract types and concrete 
values of built-in PHP types, and string values that can 
be represented by a regular expression. For function in- 
vocations, we allow developers to pinpoint an invocation 
by specifying the filename and line number where the in- 
vocation occurs. This is especially useful when function 
invocations return different values at different call sites. 
Optionally, developers can explicitly specify a set of 
privileged nodes. In contrast to implicit navigation paths 
which involve forced browsing, explicit navigation paths 
are often tested more thoroughly. However, it is still pos- 
sible that an allowed access to a sensitive node via an 
explicit navigation path of an unprivileged role is unau- 
thorized, violating Assumption 2. In this case, when an 
unprivileged user can explicitly navigate to a privileged 
node, we would have false negatives. To solve this prob- 
lem, we allow developers to explicitly specify privileged 
nodes. Such a node may be vulnerable to access control 
attacks even if it is explicitly accessible for both roles. 


5.2 Sitemap Builder 


The sitemap builder has two components: the context-free 
grammar constructor and the link extractor. With these 
two components, our analysis constructs a CFG for each 
explicitly reachable node, and extracts links embedded in 
the CFG to find outgoing edges of the node. 


5.2.1 Context-Free Grammar Constructor 


For each web page, our analyzer first parses the page 
into an Abstract Syntax Tree (AST), and then transforms 
the AST into an Intermediate Representation (IR), dis- 
tinguishing every variable occurrence. Interested readers 
can refer to Wassermann’s work [30] for more details. 
To build a per-role CFG, our analyzer explores the IR 
only when necessary by predicting branch feasibilities 
with an inter-procedural path-sensitive analysis. It ana- 
lyzes statements in the IR in a top-down manner, updating 
path conditions for both string constraints and arithmetic 
constraints. For arithmetic constraints, our analyzer re- 
sorts to the integrated Z3 to check the satisfiability of 
constraints; for string constraints, it feeds possible values 
of string variables and their aliases to our string constraint 
solver in exchange of answers. Our prototype string con- 
straint solver supports string constraints which may con- 
tain multiple variables, regular expressions, equality and 
inequality operators, and checks on string lengths. We 
tried to solve string constraints with HAMPI [15], but 
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function checkUser() { 
if (!isset($_SESSION["validUser" ]) 
|| S_SESSION["validUser"] != true) { 
header("Location: login.php"); 








} 
} 


checkUser(); 
sensitiveOperation(); 





Figure 6: An Example of Path Exploration. 


it does not support multiple string variables yet. When 
constraints of a conditional is unsolvable, the analyzer 
explores both branches, updating path conditions for both 
the true branch and the false branch. For each function 
call, our analyzer first checks its calling context and then 
explores the function only when the context is new. Next, 
it propagates constraints on the arguments and related 
global variables of the function call. The IR exploration 
terminates when all possible branches have redirections or 
exits, indicating that none of the unexplored branches are 
feasible. In our implementation, we do not consider differ- 
ent contexts of page accesses and assume the parameters 
of HTTP requests to be &* unless specified. In this way, 
we analyze each page only once, making our analyzer 
scalable at the expense of obtaining over-approximations 
of outgoing edges. 

Finding the targets of PHP includes is a non-trivial task. 
It requires value resolution of possible string variables 
that are used for filename construction. Furthermore, it 
is necessary to find the directories that a PHP include file 
may reside in. When resolving PHP include paths, the 
following steps are performed in order: 


¢ The include_path in the configuration of a PHP 
application is checked first; 


¢ If no matching file is found under include_path, 
the directory of the calling script is checked; 


¢ If no matching file is found in the directory of 
the calling script, the current working directory is 
checked; 


¢ If no matching file is found in the current working 
directory, the inclusion finally fails. 


We illustrate our basic exploration strategy with a sim- 
ple example shown in Figure 6 based on one of the web ap- 
plications that we have analyzed. Function checkUser 
checks whether an access should be allowed for a given 
user. Function SensitiveOperation will only be 
executed when the user has passed the access check. Sup- 
pose that $_SESSION|“validUser’] is a critical applica- 
tion state which determines the privileges of a role, and 


USENIX Association 

















{\t,\n,\r,_} 


2\{>} 


Figure 7: A Deterministic Finite Automaton Example. 


its value should be specified as true for role a and false 
for role b. Our analyzer explores the statements of the 
IR in order. Besides function definitions, the first state- 
ment it encounters is the function call checkUser (). 
Therefore, it retrieves the corresponding function body 
and continues from the first statement in the function. Be- 
cause the first statement is an if statement, the analyzer 
attempts to solve the satisfiability of constraints to deter- 
mine branch feasibilities. If the given role is b, only the 
true branch is feasible. As the true branch has a header 
redirection, the analyzer stops exploring the statements 
after this function call. Otherwise, when the role is a, only 
the false branch is feasible, and the analyzer continues ex- 
ploring the statements after this function call, and eventu- 
ally reaches function call SensitiveOperation (). 


Path sensitivity prevents us from exploring infeasible 
paths. For example, suppose we have predicate $x > 1 in 
the current path condition when the exploration reaches 
an if statement, the branch target of which depends on a 
conditional $x < 0. To determine the feasibilities of the 
two possible branches, our analyzer sends two queries 
to Z3. The first query appends the new constraint to the 
existing path condition, while the second query appends 
the negation of the new constraint to the existing path 
condition. Z3 will conclude that ($x > 1 A $x <0) is un- 
satisfiable, but ($x > 1 A a($x < 0)) is satisfiable. Thus, 
only the false branch is feasible and our analyzer will not 
explore the infeasible true branch of the if statement. 


5.2.2. Link Extractor 


Our link extractor extracts links to different web pages 
within a given web application. Since we are interested in 
constructing sitemaps, our link extractor filters links that 
point to pages outside of the application. We did not reuse 
the implementation from the previous work [30], which 
is based on the standard graph-reachability algorithm, but 
instead implemented the new link extraction algorithm 
shown in Figure 4 to eliminate the need of computing 
intermediate HTML outputs. As an example, Figure 7 
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shows an FA which matches anchor, form, frame and 
iframe tags in HTML outputs based on a simple regular 
expression: 





/<([Aa] 
| [F£] [00] [Rr] [Mm] 
| [Ii]? [Ff] [Rr] [Aa] [Mm] [Ee] 
\\s[*>]*>/ 


We only show state-advancing edges in Figure 7 and omit 
state-resetting edges. In this FA, the start state qo = | 
and the final state set Qy = {8}. For any FA state, a 
state-resetting edge directs the current FA state back to 
the start FA state on input characters other than the ones 
shown on the state-advancing edges. We use the following 
simplified PHP code taken from one of our test subjects 
to show how our link extractor works. 


echo "<div><a href=" 
Slang 
"  php>Anchor</a></div>"; 


The above PHP code dynamically generates a link de- 
pending on the value of variable $/ang, which has three 
possible candidates: “english”, “spanish” and “french”. 
For this code, a CFG with five variables and seven gram- 
mar productions will be generated: 


So —> S1S2 

S| — “<div><a href=” 

So + S384 

S3 — “english” | “spanish” | “french” 
S4 — “.php>Anchor</a></div>” 


In this CFG, V = {So,51,S2,53,54} and So is the start 
symbol. Note that $3 has three associated grammar pro- 
ductions separated by bars. For the algorithm in Fig- 
ure 4, the link extraction starts with function call WALK- 
VAR(So, 1,“°’) (Line 38). Since Sg maps to only one pro- 
duction, RHS = {[S;Sz]} (Line 11) and our algorithm 
issues WALKS YMBOLS([S159], 1,“”’) (Line 18). Then, it 
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examines the symbols in list [S152] (Line 32) in order to 
derive the set of possible outcomes QW, the initial value 
of which is {(1,“°’)} (Line 31). Our algorithm sees that 
the first symbol S; is a variable and thus issues WALK- 
VAR(S1, 1,“”) (Line 27). For $1, RHS ={“<div><a 
href=”} (Line 11), and the algorithm issues WALKS YM- 
BOLS(“<div><a href=’, 1,°°’) (Line 18). Now our algo- 
rithm examines these terminals in order with function 
WALKTERMINAL. The first character is ‘<’, thus the 
algorithm transits the FA state from | to 2 along a state- 
advancing edge in Figure 7, and appends ‘<’ to w which 
is now “<”. The second character is ‘d’, thus the algo- 
rithm resets the FA state to the start state 1, and clears the 
matched terminals in w. The third character is ‘i’, thus 
the algorithm stays at the FA start state 1, and w is still 
the empty string. Our algorithm continues like this and 
by the time it gets to variable $3, the FA is in state 7 with 
w =“<a href=”. For S3, RHS ={“english”, “spanish”, 
“french”} (Line 11), and our algorithm walks these three 
elements one by one (Line 15). There are three possible 
outcomes, and thus the return value QW of WALKS YM- 
BOLS(S3,7,“< a href=”) is {(7, “<a href=english”), (7, 
“<a href=spanish’”), (7, “<a href=french”)} (Line 19). 
Our algorithm continues until all the seven grammar 
rules have been explored. Upon termination, it returns 


{“english.php”, “spanish.php”, “french.php”} (Line 39). 


5.3. Vulnerability Detector 


When the construction of per-role sitemaps is complete, 
our analyzer compares the two reachable node sets to infer 
privileged nodes. As HTML outputs presented to differ- 
ent roles are usually different, the set of privileged nodes 
is not empty in most cases. After obtaining the set of 
privileged nodes, our analyzer uses the same context-free 
grammar constructor again to approximate the outcomes 
of forced browsing. Finally, it compares derived redi- 
rection sets and the sizes of CFGs to determine whether 
forced browsing attemps are successful. 

Even when forced browsing is successful, it is possible 
that the corresponding page does not contain any sensi- 
tive information or operations and is therefore considered 
safe. We observed that some pages used as file inclusions 
only contain function and class definitions. Such pages 
normally serve as inclusion files and are safe on their own. 
When the automatic vulnerability detection is over, we 
identify such safe pages with manual analysis, report them 
as false positives, and then mark the remaining pages as 
potentially vulnerable pages. 


6 Empirical Evaluation 


To evaluate the effectiveness and performance of our ap- 
proach, we tested our tool on seven real-world PHP appli- 
cations, two of which have patched versions. We picked 
these applications because they have reported vulnerabil- 
ities, which include injection vulnerabilities as well as 
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Subject Files 
PHP HTML 
SCARF 25 1,318 0 
Events Lister 37 2,076 544 
PHP Calendars 67 1,350 0 
PHPoll 93 2,571 0 
PHP iCalendar 183 8,276 0 
AWCM 668 12,942 5,106 
YaPiG 134 =©4,801 1,271 


Table 1: Statistics on Evaluation Subjects. 


access control vulnerabilities. The test subjects include 
both traditional web applications and Web 2.0 applica- 
tions which use AJAX for client-server communications. 
The source code of all these PHP applications is publicly 
available. For each of the test subjects, we provide a spec- 
ification file of at most ten lines. We ran all the tests on a 
PC with a quad-core CPU (2.40GHz) and 4 GB of RAM. 

Our tool supports multiple roles and each role should 
have a set of distinctive application states. Typically, the 
administrator role has the most privileges; the normal user 
role has necessary privileges for common user operations; 
and the public user role has the least privileges. Although 
our tool can detect access control violations for any two 
roles, we chose to detect access control violations between 
administrators and normal users for two reasons. First, the 
operations and information that administrators can access 
are of greater importance than those that normal users can 
access. Second, it is often difficult for attackers to legally 
obtain administrator accounts, but easy to obtain normal 
user accounts. 

Table 1 shows the total number of files as well as the 
lines of code for each web application. For the two web 
applications that have patched versions, we only list the 
statistics for the patched versions in the table. The lines 
of code in each application are counted for both PHP 
and HTML, excluding comments and empty lines. Our 
analysis translates HTML code into equivalent PHP echo 
statements. 


6.1 Analysis Results 


Table 2 shows the analysis results for the nine web appli- 
cations. Note that we include two versions of SCARF and 
AWCM for vulnerability analysis. Columns “Vulnerable” 
and “FP” denote the numbers of detected true vulnerabili- 
ties and manually confirmed false positives respectively. 
Column “Guarded” shows the number of privileged pages 
that are protected by access checks. The last four columns 
show numbers of explicitly reachable nodes and explicit 
edges in per-role sitemaps. 

In summary, our tool found eight different access con- 
trol vulnerabilities, four of which are previously unknown. 
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Project Privileged Vulnerable FP Guarded a — 
Node Edge Node Edge 
SCARF 4 1 0 3 19 149 15 69 
SCARF (patched) 4 0 O 4 19 149 15 69 
Events Lister 2.03 9 2 2 5 23 113 14 26 
PHP Calendars 3 1 0 2 19 35 19 30 
PHPoll v0.97 beta 3 3 0 0 21 63 19 58 
PHP iCalendar v1.1 1 0 O 1 51 292 50 292 
AWCM v2.1 47 1 0 46 176 2,634 129 2,438 
AWCM v2.2 final 47 0 O 47 180 2,851 133 2,612 
YaPiG 0.95 11 0 O 11 54 260 44 154 


Table 2: Vulnerability Analysis Results. 


It only has two false positives and correctly reports 119 
guarded pages as not vulnerable. We manually confirmed 
all vulnerabilities and false positives on deployed web 
applications. In addition, the by-products of our analysis, 
the generated per-role sitemaps, provide high-level views 
of the test subjects and can be useful for understanding or 
modifying the structures of these web applications. 


6.1.1 SCARF 


SCARF is the Standford Conference And Research Fo- 
rum. A critical access control checks whether the value 
of $_SESSION[“privilege”] equals “admin” in functions 
is_admin and require_admin. 

Our tool detected a previously reported vulnerability 
(CVE-2006-5909). In this application, only users of role a 
are supposed to edit the configuration of the application in 
page “generaloptions.php”. However, there is no access 
check for this edit privilege. Although the link is hid- 
den from users of role b, they could still access and edit 
the configuration which affects the whole system. Our 
tool correctly reported the other three privileged pages 
“addsession.php”, “editpaper.php” and “editsession.php” 
as guarded. Even if users of role b know the locations 
of these pages, forced browsing would fail because of 
the presence of access checks in these pages. The lat- 
est version of SCARF fixed the vulnerability, and this is 
reflected in the vulnerability analysis result for SCARF 
(patched). 


6.1.2. Events Lister 


Events Lister is a PHP application that allows users 
to manage their events. Function checkUser im- 
plements an access control by checking whether 
$_SESSION [‘‘validUser”] equals true. 

Our tool found a new vulnerability in this application 
as well as a previously known one (CVE-2009-3 168). We 
discovered that page “admin/setup.php” has no access 
checks and allows users of role b to repeatedly insert test 
events into the database of the application. It is even pos- 


USENIX Association 


sible to create new tables in the database if none exists yet. 
The known vulnerability in page “admin/user_add.php” 
permits users of role b to add new users into the system. 
This privilege should only belong to users of role a. We 
consider the other two reports on privileged pages “ad- 
min/recover.php” and “admin/form.php” false positives. 
Page “admin/recover.php” allows users of role b to re- 
set an administrator’s password by sending a new pass- 
word to the administrator’s email address. Since only the 
administrator has access to her own email address, the 
password reset action does not pose any serious threats. 
Page “admin/form.php” contains an HTML form which 
is included in other container pages. On its own, this page 
does not expose any privileged operations or information, 
and is therefore considered safe. The notion of “safe” is 
sometimes a subjective matter. In a manual case study of 
another web application, we found that public users can 
view the list of all registered users with forced browsing. 
Such a list is also available for normal users and one can 
easily register for a normal user account. Consequently, 
it is unclear to us if the implicit access to the list of regis- 
tered users is intended. As such, we would rather report 
such cases to developers for them to decide. 


6.1.3. PHP Calendars 


PHP Calendars is an online calendar management system. 
It protects privileged pages in the application by check- 
ing whether $_SESSION[‘admin”] equals “yes” in page 
“admin/access.php”’. 

Our tool detected a known vulnerability (CVE-2010- 
0380) in page “install.php” of this application. The 
README file in this application warns administrators to 
delete this page after installation, but does not check if 
the file has indeed been deleted. If “install.php” exists in 
a deployed application, any users of role b could modify 
the configuration of the application by directly accessing 
this page. Because there is an explicit link to this page, 
we manually added this page to the privileged node set in 
the specification file. The other two privileged pages “ad- 
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min/import.php” and “powerfeed.php” are not vulnerable. 
Note that N, is not necessarily a superset of Np. In this 
application, |N,| = |Np|, but Nz A Np. 


6.1.4 PHPoll 


PHPoll is an online poll system where only users 
of role a can pass access checks by providing 
correct values of $_COOKIE|$string_cook_login] and 
$_COOKIE|$string_cook_password]. Note that the 
cookie-based access controls are safe in this case because 
unauthorized users have no knowledge of valid cookie 
values. 

Our tool detected three new access control vulnera- 
bilities in this application and we manually confirmed 
them on a deployed application of PHPoll. All three 
pages have no access checks. The first page “modi- 
fica_configurazione.php” allows users of role b to modify 
login IDs and passwords, truncate the configuration table, 
and insert new entries into the configuration table of the 
application. The second page “modifica_votanti.php” lets 
users of role b delete votes or update polls stored in the 
MySQL database. The third page “modifica_band.php” 
does not prevent users of role b from reading, updating, 
or deleting poll results from the database with POST re- 
quests. These access control vulnerabilities pose serious 
threats to the security of the application, yet they have not 
been reported to the best of our knowledge. 


6.1.5 PHP iCalendar 


PHP iCalendar is another calendar application which 
displays calendar information to users. The only 
privileged page is “admin.php’, and it is guarded 
by an access check which examines the value of 
$HTTP_SESSION _VARS|“phpical_loggedin’]. 

This application does not have any access control vul- 
nerabilities. As Table 2 shows, users of role a can reach 
51 pages which include “admin.php”, while users of role 
b can only reach 50 pages which exclude “admin.php”. 


6.1.6 AWCM 


AWCM (AR Web Content Manage system) differ- 
entiates role a from role b by determining whether 
$_SESSION|“awcm_cp”] equals “yes” in a PHP include 
file “control/common.php”. 

Our tool detected a previously known vulnerabil- 
ity (CVE-2010-1066) in “control/db_backup.php” which 
dumps all the database information onto a web page. The 
cause of this access control vulnerability is that “con- 
trol/db_backup.php” includes “common.php” instead of 
“control/common.php”. Since access checks are only 
present in “control/common.php” but not “common.php”, 
page “control/db_backup.php” is not guarded and can be 
accessed via forced browsing. Most pages in the “control” 
directory are intended for administrators only and our tool 
detected 47 privileged nodes in total. Our tool correctly 
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recognized the access checks in the other 46 privileged 
pages and only reported “control/db_backup.php” to be 
vulnerable. The latest version of AWCM fixed the vulner- 
ability, and this is reflected in the analysis result shown 
in Table 2. Although this application is AJAX-heavy, our 
tool covered nearly 80% of the active nodes, indicating 
that a majority of the links appear in PHP and HTML 
code which can be well handled with our tool. 


6.1.7 YaPiG 


YaPiG (Yet Another PHP Image Gallery) validates pass- 
words and determines the privilege level of users with an 
access check in function check_admin_login. 

An interesting thing about YaPiG is that all the five 
unreachable pages result from an uncovered execution 
path. In our implementation, we assume that an HTTP 
parameter $v could have any values. Therefore, our tool 
infers that function call isset ($v) returns true even if 
v is undefined. When a conditional depends on such a 
function call, the false branch is left unexplored. Our im- 
plementation does not yet support the specification of an 
optional value, which can either be defined or undefined. 


6.2 Performance Evaluation 


In our evaluation, we collect links that point to files within 
an application, excluding those that point to CSS files 
which are of no interest to us. Currently, we treat PHP, 
HTML and XML files to be active nodes and analyze them 
to extract links. A page can contain links to both active 
nodes and inactive nodes. Although inactive nodes do not 
provide sensitive operations, they may contain sensitive 
information and therefore should also be checked. 

Table 3 shows the coverage and performance of our 
tool. Column “Entry” shows the number of specified en- 
try nodes for each application. Column “Active” lists the 
number of all active nodes. Column “Orphan” lists the 
number of specified orphan nodes which are non-entry 
active nodes with no incoming edges. Column “Cover- 
age” lists the coverage of our tool on active nodes in an 
application, excluding orphan nodes. We list the aver- 
age numbers of variables and grammar productions of 
all CFGs for each web application. Note that the num- 
bers are counted on CFGs that have been simplified with 
grammar-reachability analysis. The last column shows 
the total analysis time spent for each application in terms 
of seconds. 

Active nodes may have outgoing edges and may not 
have any incoming edges. An active node with no incom- 
ing edges can be optionally specified as either an entry 
node or an orphan node. When it is specified as an entry 
node, it is analyzed in the sitemap construction process 
to find outgoing edges; when it is specified as an orphan 
node, which indicates that this node should be outside 
any sitemaps, it is excluded from the coverage calcula- 
tion; when it is unspecified, it may affect the coverage 
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Project Coverage Time (s) 
Entry Active Orphan Variables Productions 
SCARF 1 19 0 158 719 —100.00% 6.02 
SCARF (patched) 1 19 0 159 719 = 100.00% 6.01 
Events Lister v2.03 4 23 5 100 2,083  100.00% 3.84 
PHP Calendars 3 15 0 48 255 80.00% 5.09 
PHPoll v0.97 beta 5 21 6 115 224 100.00% 4.26 
PHP iCalendar v1.1 2 52 2 811 4,774 90.38% 760.62 
AWCM v2.1 17 208 22 410 422 79.33% 89.48 
AWCM v2.2 final 16 209 14 451 484 79.90% 108.51 
YaPiG 0.95 7 59 3 332 532 91.53% 208.38 
Table 3: Coverage and Performance Results. 
Time (s) tions for PHP iCalendar is also the largest. We show the 
Project cans N 1 Forced break down of analysis time in Table 4. Columns “Admin 
inaaee Saas — Sitemap” and “Normal Sitemap” list the time spent on 
pilemiap “Sitemap _ Browsing constructing the sitemaps for roles a and b respectively. 
SCARF 3.15 1.70 1.15 Column “Forced Browsing” shows the time spent on de- 
Events Lister 2.29 1.00 0.53 tecting access control vulnerabilities via forced brows- 
PHP Calendars 1.81 1.67 1.61 ing. It is obvious from the data in the table that building 
PHPoll 2.39 1.54 0.33 sitemaps consumes the majority of the analysis time. 
PHP iCalendar 371.28 = 370.85 18.46 63 DiseusGons 
AWCM 55.36 49.11 3.85 i 
YaPiG 85.59 44.91 77.86 As we mentioned earlier, our prototype did not find all 


Table 4: Analysis Time. 


result. Let Active, Orphan and Reachable denote the sets 
of all active nodes, specified orphan nodes and explicitly 
reachable nodes respectively. We calculate the coverage 


as: 
|Reachable| 


C FS 
sade |Active| — |Orphan| 


In our evaluation, we conservatively identify orphan 
nodes with a simple manual analysis and the obtained 
orphan sets may be incomplete, especially for large and 
complex applications. Therefore, the real coverages of our 
analysis might be better than the ones shown in the table 
because uncovered nodes might indeed be unreachable. 
Our static analyzer achieved good coverage of active 
nodes: 100% for four applications, about 90% for two, 
and about 80% for the remaining three. The total analy- 
sis time listed in Table 3 demonstrates that our approach 
is scalable. For the smaller test applications SCARF, 
Events Lister, PHP Calendars and PHPoll, our tool fin- 
ished within seven seconds; for the largest test application 
AWC, our tool took less than two minutes to analyze 
the active nodes in the whole application. The analysis 
time for iCalendar is the longest because of the inlining of 
dynamic PHP files and the complexity of PHP code. As 
can be seen in Table 3, the number of grammar produc- 
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kinds of links in web applications. The major reason is 
that our prototype did not identify all the links generated 
by JavaScript code or HTML templates, or those con- 
structed with unresolvable string variables. Extracting 
links from JavaScript code is especially challenging be- 
cause of the dynamic features of the JavaScript language. 
Our prototype works better on traditional web applica- 
tions than AJAX-heavy ones. Incorporating JavaScript 
analysis could possibly improve the coverage. Further- 
more, our test applications may not be representative of 
general web applications. 

What a node represents determines the granularity of 
the analysis. Our prototype treats a web page as a node, 
but the general approach still applies when the granularity 
is refined to functionalities within a page. Performing 
the analysis at a refined granularity would be especially 
useful for complex web pages which contain multiple 
functionalities within a single page. The techniques pro- 
posed by Halfond et al. [12] could be used to identify 
important parameters in web applications to distinguish 
functionalities. Because a privilege is often granted with 
a set of atomic database operations, advancing the gran- 
ularity to the level of database operations might be too 
fine-grained. 

Our prototype does not handle all object-oriented fea- 
tures in PHP. This prevents us from parsing some PHP 
pages in large PHP applications. We leave it as future 
work to enhance our static analyzer for additional object- 
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oriented features of the PHP language. 

The current implementation of the string constraint 
solver is rudimentary. For either unsolvable constraints 
or non-determinism in a conditional, we conservatively 
explore both branches. This might lead to false negatives 
when infeasible paths for a less privileged role are ex- 
plored. For access checks that involve non-determinism, 
such as password-based authentication and CSRF pro- 
tection that uses random tokens, we rely on role-based 
specifications to determine which execution paths to ex- 
plore. Non-determinism affects path explorations but not 
link extractions. Furthermore, when Assumption 2 does 
not hold, we would also have false negatives introduced 
by explicit accesses to privileged nodes. 

Our tool generated false positives. Even when access 
checks are missing in hidden pages, these pages may not 
contain any sensitive information or operations and are 
therefore safe to access for any role in the application. We 
manually examined the analysis results and marked such 
safe pages as false positives. 


7 Related Work 


In this section, we discuss the most relevant work, includ- 
ing specification inference, workflow violation detection, 
privilege separation based on user roles, language-based 
approaches to secure web applications, and program anal- 
ysis for web security. 

The capability of automated tools in detecting vulnera- 
bilities or bugs can only be as good as the specifications 
given to them. Since manually writing specifications is 
tedious, time-consuming and error-prone, a wide range 
of techniques have been proposed to automatically infer 
specifications from the source code of programs. For in- 
trusion detection, Wagner and Dean [28] apply static anal- 
ysis to derive a model of normal application behavior as an 
oracle. Based on the observation that bugs are deviant be- 
havior [9], researchers have proposed probabilistic-based 
approaches [16, 26] to infer specifications from applica- 
tions. However, without taking into account of roles in 
web applications, it is difficult to infer privileged pages 
which are only intended for a group of users. 

Recently, workflow violations have attracted the in- 
terests of researchers. Nemesis [7] uses dynamic infor- 
mation flow tracking to detect authentication and access 
control vulnerabilities in web applications. It requires 
developers to specify access control lists for resources. 
Similarly, Hallé et al. [13] proposed a runtime enforce- 
ment mechanism to only allow navigations that conform 
to a state machine model specified by developers. Re- 
searchers have proposed various techniques to automat- 
ically infer correct workflows. Swaddler [6] first learns 
internal states of web applications, and then detects ab- 
normal state violations at critical points. Targeting the 
detection of Ajax intrusion attacks, Guha et al. [11] lever- 
age static analysis on client-side JavaScript code to infer 
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expected server-side behavior. To detect multi-module 
vulnerabilities, MiMoSA [1] takes into account the in- 
teractions of different web pages. However, it is not 
always easy to distinguish an intended path from an unin- 
tended one because of flexible navigation paths that web 
applications allow. Its follow-up work Waler [10] uses 
a combination of dynamic analysis and symbolic model 
checking to first infer invariants from dynamic program 
executions, and then report violations of the invariants as 
logic vulnerabilities. From a high-level view, the likely 
invariants that Waler generates with heuristics are subject 
to errors. Furthermore, the inferred invariants may not 
always hold due to the limited coverage of dynamic anal- 
ysis. Access control vulnerabilities can be considered a 
special case of workflow vulnerabilities where cross-role 
workflow assumptions are violated. Cross-role compar- 
isons allow us to precisely reason about privileged pages 
in most cases. 


To reduce least-privilege incompatibilities, researchers 
distinguish different user roles and separate privileges 
based on different roles. Aiming at identifying dependen- 
cies on admin privileges in traditional software appli- 
cations, Chen et al. [4] run applications without admin 
privileges and collect dynamic execution traces. We take 
a step further and use roles to represent sets of privileges 
in web applications. In our setting, roles form a lattice and 
its height is not limited. To reduce developer’s burden on 
securing web applications, the CLAMP project [23] pre- 
vents leakage of sensitive information by restricting the 
flows of user data and isolating the authentication module 
of an application. While they also minimize developers’ 
effort, they secure web applications by modifying appli- 
cation code at critical points. Web application vulnerabil- 
ity scanners can also automatically detect access control 
vulnerabilities. However, they often build shallow and 
incomplete sitemaps, missing deep and invisible pages 
that are only accessible when valid form data are submit- 
ted. This undermines the capabilities of web scanners in 
both discovering privileged nodes as well as successfully 
performing forced browsing with valid form data. 


Previous work has proposed language-based ap- 
proaches to secure web applications in a principled way. 
SIF [5] accepts specifications either as program annota- 
tions at compile time, or as user requirements at run time 
to guarantee confidentiality and integrity with informa- 
tion flow analysis. Recently, Krishnamurthy et al. [17] 
presented an object-capability language for fine-grained 
privilege separation for web applications. Unfortunately, 
theses language-based approaches do not apply to the 
large set of legacy code that is not written in the newly 
designed languages. 


In the past few years, researchers have focused their 
attention on detecting injection vulnerabilities in web ap- 
plications with both static analysis [18, 19, 25, 27, 29, 30, 
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31, 32] and dynamic analysis [2, 3, 22, 24]. Similar to our 
static analyzer, Pixy [14] is also a static analyzer built to 
analyze PHP applications. It takes advantage of taint anal- 
ysis to detect injection vulnerabilities with specifications 
on taint sources and sinks. Its implementation hinders it 
from scaling to large applications as Pixy has no support 
for include resolution and object-oriented features. 


8 Conclusions 


Developers should enforce access controls throughout 
web applications for every privileged page. This paper 
proposes a novel approach to detect access control vul- 
nerabilities in web applications with minimal manual ef- 
fort. Based on the observation that sitemaps presented 
to different roles are not identical, our analysis first au- 
tomatically infers the set of privileged pages from the 
source code of a web application, and then detects access 
control vulnerabilities via forced browsing. We added 
support for role-based specification rules, and integrated 
constraint-solving capabilities with our static analyzer to 
systematically explore program paths. Our tool is able 
to achieve good coverage and scale to real-world applica- 
tions. The evaluation results demonstrate that it is capable 
of detecting both unknown and known access control vul- 
nerabilities in unmodified web applications with only a 
few lines of specifications. For future work, we plan to 
support additional language features of PHP, enhance the 
string constraint solver, and scale the analysis to larger 
web applications. 
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ADsafety 
Type-Based Verification of JavaScript Sandboxing 
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Abstract adnet.com 
Web sites routinely incorporate JavaScript programs oe 2 = 


from several sources into a single page. These sources 
must be protected from one another, which requires ro- 
bust sandboxing. The many entry-points of sandboxes 
and the subtleties of JavaScript demand robust verifica- 
tion of the actual sandbox source. We use a novel type 
system for JavaScript to encode and verify sandboxing 
properties. The resulting verifier is lightweight and effi- 
cient, and operates on actual source. We demonstrate the 
effectiveness of our technique by applying it to ADsafe, 
which revealed several bugs and other weaknesses. 


1 Introduction 


A mashup Web page displays content and executes 
JavaScript from various untrusted sources. Facebook ap- 
plications, gadgets on the iGoogle homepage, and vari- 
ous embedded maps are the most prominent examples. 
By now, mashups have become ubiquitous. Indeed, web 
pages that display advertisements from ad networks are 
also mashups, because they often employ JavaScript for 
animations and interactivity. A survey of popular pages 
shows that a large percentage of them include scripts 
from a diverse array of external sources [41]. Unfortu- 
nately, these third-party scripts run with the same privi- 
leges as trusted, first-party code served directly from the 
originating site. Hence, the trusted site is susceptible to 
attacks by maliciously crafted third-party software. 

This paper addresses language-based Web sandbox- 
ing systems, one of several mechanisms for securing 
mashups. Most sandboxing mechanisms have similar 
high-level goals and designs, which we outline in sec- 
tion 2. In section 3, we review the design and implemen- 
tation of sandboxes and demonstrate the need for tool- 
supported verification. Section 4 provides a detailed plan 
for the rest of the paper. Our work makes several contri- 
butions: 
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Figure 1: Web sandboxing architecture 


1. A type system for general JavaScript programs, 
with support for patterns found in sandboxing li- 
braries;! 

2. a formal definition of safety properties for Yahoo!’s 
ADsafe sandbox in terms of this type system; and, 

3. a type-based verification of the ADsafe framework, 
and descriptions of bugs and their fixes found while 
performing the verification. 


2 Language-based Web Sandboxing 


The Web browser environment provides references to ob- 
jects that implement network access, disk storage, geo- 
location, and other capabilities. Legitimate web applica- 
tions use them for various reasons, but embedded wid- 
gets can exploit them because all JavaScript on a page 
runs in the same global environment. A Web sand- 
box thus attenuates or prevents access to these capa- 
bilities, allowing pages to safely embed untrusted wid- 


'See cs.brown.edu/research/plt/dl/adsafety/v1 
for our implementation, proofs, and other details. 
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gets. ADsafe [9], Caja [33], FBJS [13], and Browser- 
Shield [35] are language-based sandboxes that employ 
broadly similar security mechanisms, as defined by Maf- 
feis, et al. [27]: 
¢ A Web sandbox includes a static code checker that 
filters out certain widgets that are almost certainly 
unsafe. This checker is run before the widget is de- 
livered to the browser. 


¢ A Web sandbox provides runtime wrappers that at- 
tenuate access to the DOM and other capabilities. 
These wrappers are defined in a trusted runtime li- 
brary that is linked with the untrusted widget. 


e Static checks are necessarily conservative and can 
reject benign programs. Web sandboxes thus spec- 
ify how potentially-unsafe programs are rewritten 
to use dynamic safety checks. 


This architecture is illustrated in figure 1, where an un- 
trusted widget from adnet . com is embedded in a page 
from paper.com. The untrusted widget is filtered by 
the static checker. If static checking passes, the widget is 
rewritten to invoke the runtime library. Both the runtime 
library and the checked, rewritten widget must be hosted 
on a site trusted by paper .com, and are assumed to be 
free of tampering. 


Reference Monitors A Web sandbox implements a 
reference monitor between the untrusted widget and the 
browser’s capabilities. Anderson’s seminal work on ref- 
erence monitors identifies their certification demands [3, 
p 10-11): 


The proof of [a reference monitor’s] model se- 
curity requires a verification that the modeled 
reference validation mechanism is tamper re- 
sistant, is always invoked, and cannot be cir- 
cumvented. 


Therefore, a Web sandbox must come with a precisely 
stated notion of security, and a proof that its static checks 
and runtime library correctly maintain security. The end 
result should be a quantified claim of safety over all pos- 
sible widgets that execute against the runtime library. 


3 Code-Reviewing Web Sandboxes 


Imagine we are confronted with a Web sandbox and 
asked to ascertain its quality. One technique we might 
employ is a code-review. Therefore, we perform an 
imaginary review of a Web sandbox, focusing on the de- 
tails of ADsafe. Later, we will discuss how to (mostly) 
remove people from the loop. 

ADsafe, like all Web sandboxes, consists of two inter- 
dependent components: 
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° a static verifier, called JSLint,2 which filters out 
widgets not in a safe subset of JavaScript, and 


¢ aruntime library, adsafe.js, which implements 
DOM wrappers and other runtime checks. 


These conspire to make it safe to embed untrusted wid- 
gets, though “safe” is not precisely defined. We will re- 
turn to the definition of safety in section 4. 


Attenuated Capabilities Widgets should not be able 
to directly reference various capabilities in the browser 
environment. Direct DOM references are particularly 
dangerous because, from an arbitrary DOM reference, 
elt, a widget can simply traverse the object graph and 
obtain references to all capabilities: 


var myWindow = elt.ownerDocument.defaultView; 
myWindow. XMLHttpRequest; 

myWindow. localStorage; 

myWindow. geolocation; 


Widgets therefore manipulate wrapped DOM elements 
instead of direct references. DOM wrappers form the bulk 
of the runtime library and include many dynamic checks 
and patterns that need to be verified: 


¢ The runtime manipulates DOM references, but re- 
turns them to the widget in wrappers. We must ver- 
ify that all returned values are in fact wrapped, and 
that the runtime cannot be tricked into returning a 
direct DOM reference. 


e The runtime calls DOM methods on behalf of the 
widget. Many methods, such as appendchild and 
removeChild, require direct DOM references as ar- 
guments. We must verify that the runtime cannot be 
tricked with a maliciously crafted object that mim- 
ics the DOM interface and steals references. 


e The runtime attaches DOM callbacks on behalf of 
the widget. These callbacks are invoked by the 
browser with event arguments that include direct 
DOM references. We must verify that the runtime 
appropriately wraps calls to untrusted callbacks in 
the widget. 


e The widget has access to a DOM subtree that it is 
allowed to manipulate. The runtime ensures that 
the widget only manipulates elements in this sub- 
tree. We must verify that various DOM traversal 
methods, such as document.getElementById and 
Element .getParent, do not allow the widget obtain 
wrappers to elements outside its subtree. 


e The runtime wraps many DOM functions that 
are only conditionally safe. For example, 
document .createElement is usually safe, unless it 


2JSLint can perform other checks that are not related to ADsafe. In 
this paper, “JSLint” refers to JSLint with ADsafe checks enabled. 
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is used to create a <script> tag, which can load ar- 
bitrary code. Similarly, the runtime may allow wid- 
gets to set CSS styles, but a CSS URL-value can also 
load external code. We must verify that the argu- 
ments supplied to these DOM functions are safe. 


ADsafe’s DOM wrappers are called Bunches, which wrap 
collections of HTML elements. There are twenty Bunch- 
manipulating functions that are exposed to the widget — 
in addition to several private helper functions —that face 
all the issues enumerated above and need to be verified. 
These functions cannot be verified in isolation, because 
their correctness is dependent on assumptions about the 
kinds of values they receive from widgets. These as- 
sumptions are discharged by the static checks in JSLint 
and other runtime checks to avoid loopholes and com- 
plexities in JavaScript’s semantics. 


JavaScript Semantics A Web sandbox must contend 
with JavaScript features that hinder security: 
¢ Certain JavaScript features are unsafe to use in wid- 
gets. For example, a widget can use this to obtain 
window, So it is rejected by JSLint: 


f = function() { return this; }; 
var myWindow = £(); 


We must verify that the subset of JavaScript admit- 
ted by the static checker does not violate the as- 
sumptions of the runtime library. 

e¢ Many JavaScript operators and functions include 
implicit type conversions and method calls that are 
difficult to reason about. For example, when an op- 
erator expects a string but is instead given an object, 
it does not signal an error. Instead, it calls the ob- 
ject’s toString method. It is easy to write a stateful 
toString method that returns different strings on 
different calls. Such an object can then circumvent 
dynamic safety checks that are not carefully writ- 
ten to avoid triggering implicit method calls. These 
implicit calls are avoided by carefully testing the 
runtime types of untrusted values, using the typeof 
operator. Such tests are pervasive in ADsafe. As a 
further precaution, ADsafe tries to ensure that wid- 
gets cannot define toString and valueof fields in 
objects. 


JavaScript Encapsulation JavaScript objects have no 
notion of private fields. If object operations are not re- 
stricted, a widget could access built-in prototypes (via 
the proto __ field) and modify the behavior of the con- 
tainer. Web sandboxes statically reject such expressions: 


obj. proto; 


There are various other dangerous fields that are also 
blacklisted and hence rejected by sandboxes. However, 
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ADSAFE ADSAFE . get (obj, name ) 
dojox.secure get (obj, name) 
Caja $v .r ($v.ro ('obj') , $v.ro ("name ') ) 
WebSandbox c(d.obj,d.name) 
FBJS a12345_obj [SFBUS .idx (name) ] 


Figure 2: Similar Rewritings for obj [name] 


syntactic checks alone cannot determine whether com- 
puted field names are unsafe: 


obj ["__ pro" + "EO. "5 


Widgets are instead rewritten to use runtime checks that 
restrict access to these fields. Figure 2 shows the rewrites 
employed by various sandboxes. Some sandboxes insert 
these and other checks automatically, giving the illusion 
of programming in ordinary JavaScript; ADsafe is more 
spartan, requiring widget authors to insert the dynamic 
checks themselves; but the principle remains the same. 

Web sandboxes also simulate private fields with this 
method by introducing fields and then preventing wid- 
gets from accessing them. For example, ADsafe stores 
direct DOM references inthe nodes ___ field of Bunches, 
and blacklists the _ nodes ___ field. 


The Reviewability of Web Sandboxes 


We have highlighted a plethora of issues that a Web 
sandbox must address, with examples from ADsafe. Al- 
though ADsafe’s source follows JavaScript “best prac- 
tices,” the sheer number of checks and abstractions make 
it difficult to review. There are approximately 50 calls to 
three kinds of runtime assertions, 40 type-tests, 5 regular- 
expression based checks, and 60 DOM method calls in 
the 1, 800 LOC adsafe.js library. Various ADsafe bugs 
were found in the past and this paper presents a few more 
(section 9). Note that ADsafe is a small Web sandbox 
relative to larger systems like Caja. 

The Caja project asked an external review team to per- 
form a code review [4]. The findings describe many low- 
level details that are similar to those we discussed above. 
In addition, two higher-level concerns stand out: 

e “[Caja is] hard to review. No map states invariants 
and points to where they are enforced, which hurts 
maintainability and security.” 

¢ “Documentation of TCB is necessary for reviewa- 
bility and confidence.” 

These remarks identify an overarching requirement for 
any review: the need for specifications so that readers can 
both determine whether these fit their needs and check 
whether these are implemented correctly. 
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4 Verifying a Sandbox: Our Roadmap 


Defining Safety Because humans are expensive and 
error-prone, and because the code review needs to be re- 
peated every time the program changes, it is best to au- 
tomate the review process. However, before we begin 
automating anything, we need some definition of what 
security means. We focus on a definition that is spe- 
cific to ADsafe, though the properties are similar to the 
goals of other web sandboxes. From correspondence 
with ADsafe’s author, we initially obtained the follow- 
ing list of intended properties (rewritten slightly to use 
the terminology of this paper). 


Definition 1 (ADsafety) [f the containing page does not 
augment built-in prototypes, and all embedded widgets 
pass JSLint, then: 
1. widgets cannot load new code at runtime, or cause 
ADsafe to load new code on their behalf; 


2. widgets cannot affect the DOM outside of their des- 
ignated subtree; 


3. widgets cannot obtain direct references to DOM 
nodes ; and 


4. multiple widgets on the same page cannot commu- 
nicate. 


Note that the first two properties are common to sand- 
boxes in general—allowing arbitrary JavaScript to load 
at runtime compromises all sandboxes’ security goals, 
and all sandboxes provide mediated access to the DOM 
by preventing direct access. 

We also note that the assumption about built-in pro- 
totypes is often violated in practice [14]. Nevertheless, 
like ADsafe, we make this assumption; mitigating it is 
outside our scope. Given this definition, our goal is to 
produce a (mostly) automated verification that supports 
these properties. 


Verifying Safety In this paper we perform this automa- 
tion using static types, presenting a type-based approach 
for defining and verifying the invariants of ADsafe. 
While one could build a custom tool to do this, we are 
able to perform our verification by extending (as dis- 
cussed in section 11) a type checker [18] intended for 
traditional type-checking of JavaScript. 

We choose a Static type system as our tool of choice 
for several reasons. Programmers are familiar with type 
systems, and ours is mostly standard (we discuss non- 
standard features in sections 5 and 7). This lessens the 
burden on sandbox developers who need to understand 
what the verification is saying about their code. Sec- 
ond, our type system is much more efficient than most 
whole-program analyses or model checkers, leading to a 
quick procedure for checking ADsafe’s runtime library 
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(20 seconds). Efficency and understandability allow for 
incremental use in a tight development loop. Finally, our 
type system is accompanied by a soundness proof. This 
property accomplishes the actual verification. Thus, the 
features of comprehensibility, efficiency, and soundness 
combine to make type checking an effective tool for ver- 
ifying some of the properties of web sandboxes. 

In order to demonstrate the effectiveness of our type- 
based verification approach, we use type-based argu- 
ments to prove ADsafety. We mostly achieve this (sec- 
tion 8) after fixing bugs exposed by our type checker 
(section 9). The rest of this paper presents a typed ac- 
count of untrusted widgets and the ADsafe runtime. 


¢ The ADsafety claim is predicated on widgets pass- 
ing the JSLint checker. Therefore, we need to model 
JSLint’s restrictions. We do this in section 5. 


¢ Once we know what we can expect from JSLint, 
we can verify the actual reference monitor code in 
adsafe.js using type-checking (section 7). 


¢ Before we can verify adsafe.js, we need to ac- 
count for the details of JavaScript source and model 
the browser environment in which this code runs. 
Section 6 presents this additional work. 


We discuss extensions to verify other Web sandboxes in 
section 10. 


5 Modeling Secure Sublanguages 


All web sandboxes’ runtime libraries expect to exe- 
cute against widgets that have been statically checked 
and rewritten, as shown in figure 1. These checks and 
rewrites enforce that widgets are written in a sublan- 
guage of JavaScript. This sublanguage ought to be spec- 
ified explicitly. We focus here on modeling the checks 
performed by JSLint, ADsafe’s static checker, which 
presents an interesting challenge: there is no formal 
specification of the language of JavaScript programs that 
pass JSLint. Instead, the specification is implicit in the 
implementation of JSLint itself. In this section, we de- 
sign a specification for JSLint-ed widgets and give con- 
fidence in its correctness.* 

Only a fraction of JSLint’s static checks are related 
to ADsafe. The rest are 1 int-like code-quality checks. 
JSLint also checks the static HTML of a widget. Verifying 
this static HTML is beyond the scope of our work; we do 
not discuss it further. We instead focus on the security- 
critical static JavaScript checks in JSLint. 


Because we want a strategy that extends to other sandboxes, we 
do not try to exploit the fact that JSLint is written in JavaScript. The 
Cajoler of Caja is instead written in Java, and the filters and rewriters 
for other sandboxes might be written in other languages. The strategy 
we outline here avoids both getting bogged down in the details of all 
these languages as well as over-reliance on JavaScript itself. 
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a :=_ type identifiers 

T := Num | Str| True | False | Undef | Null 
| Ref T|Va.T | wa.T 
| [T]TxK...xTxT-.-3T 
| T|[L|@TUT| TAT [Array (T) 
| {x:F proto: T,code:T,f:F,...} 
|, Gisael™ | eeee 

F := T || Absent 


Figure 3: Type Language for ADsafe and Widgets 


How is JSLint used? The ADsafe runtime makes sev- 
eral assumptions about the shape of values it receives 
from widgets. These assumptions are not documented 
precisely, but they correspond to various static checks 
in JSLint. To model JSLint, we reflect these checks in 
a type, called Widget, which we define below. In sec- 
tion 5.2 we discuss how this type relates to the behavior 
of the JSLint implementation. 


5.1 Defining Widget 


We expect that all variables and sub-expressions of wid- 
gets are typable as Widget. The ADsafe runtime can thus 
assume that widgets only manipulate Widget-typed val- 
ues. Our full type language is shown in figure 3 and in- 
troduced gradually in the rest of this section. 


Primitives JSLint admits JavaScript’s primitive val- 
ues, with trivial types: 


Num U Str U True U False 
UNull U Undef 


Prim = 


We have separate types for True and False because they 
are necessary to type-check adsafe.js (section 7). Prim 
is an untagged union type, and our type system ac- 
counts for common JavaScript patterns for discriminat- 
ing unions. We might initially assume that 


Widget = Prim 


Objects and Blacklisted Fields }JSLint admits object 
literals but blacklists certain field names as dangerous. 
All other fields are allowed to contain widget values. We 
therefore augment the Widget type to include objects. An 
object type explicitly lists the names and types of various 
fields in an object. In addition, the special field « speci- 
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fies the type of all other fields: 


KIO, 
"arguments" : &, 
NGallex" ¢ = 
Neal lee": &, 


Teval” 5 & 


Widget = jsa.Prim U Ref 


"toString" : Absent, 
"valueof" : Absent 


The full list of blacklisted fields is in figure 4. Our type 
checker signals a type error on any &-typed field ac- 
cess or assignment. This mirrors the behavior of JSLint, 
which also rejects field accesses and assignments on 
blacklisted fields (e.g., o["constructor"] is rejected by 
both the type checker and JSLint). 

The Ref tag indicates that the object is mutable. We 
use a recursive type (1) to indicate that all other fields, 
x, may recursively contain Widget-typed values.* JSLint 
tries to ensure that objects in widgets do not have 
toString and valueOf properties. We model this with 
a type Absent, which ensures these fields are not present. 

Absent and & properties are subtly different. & mod- 
els fields that are intended to be inaccessible, and hence 
looking them up is untypable. In contrast, the typing rule 
for Absent field lookup performs the lookup with the type 
of the proto field, which we introduce below. Section 7.1 
contains the details of type-checking field access. 


Functions Widgets can create and apply functions, so 
we must widen our Widget type to admit them. Func- 
tions in JavaScript are objects with an internal code field, 
which we add to allowed objects: 


code : [Global Uala--- > a, 
..Ref< x*:a, 


The type of the code field indicates that widget-functions 
may have an arbitrary number of Widget-typed argu- 
ments and return Widget-typed results.> It also speci- 
fies that the type of the implicit this-argument (written 
inside brackets) may be either Widget or Global. The 
type Global is not a subtype of Widget, which expresses 
the underlying reason for JSLint’s rejection of all wid- 
gets that contain this (see Claim | below). If the this- 
annotation is omitted, the type of this is T. 


Prototypes JSLint does not allow widgets to explic- 
itly manipulate objects’ prototypes. However, since field 


411a.T binds the type variable cv in the type T to the whole type, 
jia.T’. Therefore, cv is in fact the type Widget. 

>The a--- syntax is a literal part of the type, and means the func- 
tion can be applied to any number of additional a-typed arguments. 
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lookup in JavaScript implicitly accesses the prototypes, 
we specify the type of prototypes in Widget: 


proto : Object U Function U..., 
..Ref< x*:a, 


The proto field enumerates several safe prototypes, but 
notably omits DOM prototypes such as HTMLElement, 
since widgets should not obtain direct references to the 
DOM. 


Typing Private Fields In addition to explicitly black- 
listed field names, JSLint also blacklists all field names 
that start and end with an underscore. This effectively 
blacklists the _ proto__ field, which gives direct access 
to the prototype-chain, and the nodes and__star__ 
fields, which adsafe.js uses internally to build the 
Bunch abstraction. To keep our types simple, we enu- 
merate these three fields instead of pattern-matching on 
field names: 


"nodes" :Array(HTML)UUndef, 
W proto.” : & 

..Ref4 star ___": Bool U Undef, 
x1, 


The _ proto __ field is &-typed, like other blacklisted 
fields that are never used. However, the ADsafe run- 
time uses__ nodes _and__star _ as private fields. The 
types specify that ADsafe stores DOM references in the 
__nodes __ field. 

The full Widget type in figure 4 is a formal specifica- 
tion of the shape of values that adsafe.js receives from 
and sends to widgets. This type is central to our verifica- 
tion of adsafe.js and of JSLint. 


5.2 Widget and JSLint Correspondence 


Though we have offered intuitive arguments for why 
Widget corresponds to the checks in JSLint, we would 
like to gain confidence in its correspondence with the be- 
havior of the actual JSLint program that sites use: 


Claim 1 (Linted Widgets Are Typable) /f JSLint (with 
ADsafe checks) accepts a widget e, then e and all of its 
variables and sub-expressions can be Widget-typed. 


We validate this claim by testing. We use ADsafe’s sam- 
ple widgets as positive tests — widgets that should be ty- 
pable and lintable—and our own suite of negative test 
cases (widgets that should be untypable and unlintable). 
Note the direction of the implication: an unlintable wid- 
get may still be typable, since our type checker admits 
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Widget = pa. 


Str U Num U Null U Bool U Undef U 

Object U Function 
UBunch U Array U RegExp 
UString U Number U Boolean, 


proto : 


x2 a, 
code : [Global UaJa--- > a, 
"nodes__" :Array(HTML)UUndef, 
Ref 2 "___star___": Bool U Undef, 
"caller" : &, "calles 3 =: 
"eval": &. "prototype" : &. 
"watch" : "constructor" : &, 
W “proto. "4 &, "unwatch" : & 
"arguments" : = "valueof" : Absent, 
"toString" : Absent 


Figure 4: The Widget type 


safe widgets that JSLint rejects.° The type checker could 
be used as a replacement for JSLint’s ADsafe checks, but 
these tests give us confidence that checking the Widget 
type corresponds to what JSLint admits in practice. 


6 Modeling JavaScript and the Browser 


Verification of a Web sandbox must account for the id- 
iosyncrasies of JavaScript. It also needs to model the run- 
time environment—provided by the browser—in which 
the sandboxed code will execute. Here we discuss how 
we model the language and the browser. 


JavaScript Semantics We use the semantics of Guha, 
et al. [17], which reduces JavaScript to a core semantics 
called \.75. This latter language models the “essentials” 
of JavaScript: prototype-based objects, first-class func- 
tions, basic control operators, and mutation. 

A zs thus omits many of JavaScript’s complexities, but 
it is accompanied by a desugaring function that maps all 
JavaScript programs (idiosyncrasies included) to behav- 
iorally equivalent \.75 programs. The transformation ex- 
plicates much of JavaScript’s implicit semantics. Hence, 
we find it easier to build tools that analyze the much 
smaller \,;5 language than to directly process JavaScript. 

Does desugaring faithfully map JavaScript to A735? 
Guha, et al. test their desugaring and semantics on por- 
tions of the Mozilla JavaScript test suite. On these 
tests, 75 programs produce exactly the same output as 
JavaScript implementations. Hence, their work substan- 
tiates the following two claims. 


©The supplemental material contains examples of the differences. 
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{ 
eval: &. 
setTimeout: (Widget + Widget) x Widget — Int, 
document: { 
write: 
writeln: &, 


}, 
Figure 5: A Fragment of the Type of window 


Claim 2 (Desugaring is Total) For all JavaScript pro- 
grams e, desugar|e] is defined. 


Claim 3 (Desugar Commutes with Eval) For all 
JavaScript programs e, desugar|evaljavascrip:(€)] = 
eval, (desugar[e]). 


This testing strategy, and the simplicity of implementa- 
tion that A735 enables, give us confidence that our tools 
correctly account for JavaScript. 


Modeling the Browser DOM ADsafety claims that 
window.eval is not applied. To validate this claim, 
we mark eval with & from section 5, which marks 
banned fields. There are many evai-like function 
in Web browsers, such as document.write; these are 
also marked &. Finally, certain functions, such as 
setTimeout, behave like eval when given strings as ar- 
guments. ADsafe does need to call these functions, but 
it is careful to never call them with strings. In our type 
environment, we give them restrictive types that disallow 
string arguments. 

Figure 5 specifies a fragment of the type of window, 
which carefully specifies the type of unsafe functions in 
the environment. The remaining safe DOM does not need 
to be fully specified. adsafe.js only uses a small sub- 
set of the DOM methods. These methods require types. 
The browser environment is therefore modeled with 500 
lines of object types (one field per line). This type envi- 
ronment is essentially the specification of foreign DOM 
functions imported into JavaScript. 


7 Verifying the Reference Monitor 


In section 5, we discussed modeling the sublanguage of 
widgets interacting with the sandboxing runtime. In the 
case of ADsafe and JSLint, we built up the Widget type 
as a specification of the kinds of values that the reference 
monitor, adsafe.js, can expect at runtime. In this sec- 
tion, we discuss how we use the Widget type to model the 
boundary between reference monitor and widget code, 
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var dom = { 
append: 
function (bunch) 
/«: [Widget U Global]Widget x Widget --- + Widget «/ 
{ // body of append ... }, 
combine: 
function (array) 
/«: [Widget U Global]Widget x Widget --- + Widget «/ 
{ // body of combine... }, 
q: 
function (text) 
/«: [Widget U Global]Widget x Widget --- — Widget «/ 
{ // body of q... }, 
// ... more dom... 


hi 


Figure 6: Annotations on the dom object 


and ensure that the runtime library correctly guards crit- 
ical behavior. 

The Widget type specifies the shape of widget values 
that the ADsafe runtime manipulates. Widget is therefore 
used pervasively in our verification of adsafe.js. For 
example, consider a typical Bunch method: 
Bunch.prototype.append = function(child) { 


reject_global (this) ; 
var elts = child. nodes_; 


return this; 
} 
The Bunch objects that ADsafe passes to the widget have 
Bunch. prototype as their proto (see figure 4), making 
these methods accessible. Their use in the widget is con- 
strained only by JSLint, so we must type-check these 
methods with (only) JSLint’s assumptions in mind. 

For example, we might assume that the child argu- 
ment above should be a Bunch, the implicit this argu- 
ment should also be a Bunch, and it therefore returns a 
Bunch. However, JSLint does not provide such strong 
guarantees. Consider this example, which passes JSLint: 


var func = someBunch. append; 
func(900, true, "junk", -7); 


Here, this is bound to window, child is a number, and 
there are additional arguments. Therefore, we cannot as- 
sume that append has the type [Bunch]Bunch — Bunch. 
Instead, the most precise type we can ascribe is: 


[Widget U Global]Widget - - - + Widget 


That is, this could be Widget-typed or the type of the 
global object, Global, and the other arguments may have 
any subtype of Widget, which includes strings, num- 
bers, and other non-Bunch types. The runtime check 
in append’s body (namely, reject_global (this) ) is re- 
sponsible for checking that this is not the global object 
before manipulating it. Our type checker recognizes such 
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checks and narrows the broader type to Widget after ap- 
propriate runtime checks are applied (section 7.1). If 
such checks were missing, the type of this would re- 
main Widget U Global, and return this would signal a 
type error because Widget UGlobal is not a subtype of the 
stated return type Widget. 

Ascribing types to functions provided by the ADsafe 
runtime is therefore trivial. We give all the same type: 


[Widget U Global]Widget - - - +> Widget 


The type checker we extend is not ADsafe-specific, and 
requires explicit type annotations. However, since all the 
annotations are identical, they are trivial to insert. Fig- 
ure 6 shows a small excerpt of such annotations, which 
the checker reads from comments, so programs can run 
unaltered in the browser. 


Types for Private Functions ADsafe also has a num- 
ber of private functions, which are not exposed to the 
widget. These functions have types with capabilities the 
widget does not have access to, such as HTML. For ex- 
ample, ADsafe specifies a hunter object, which con- 
tains functions that traverse the DOM and accumulate ar- 
rays of DOM nodes. These functions all have the type 
HTML -—> Undef, and add to an array result that has 
type Array(HTML). ADsafe can freely use these capa- 
bilities inside the library as long as it doesn’t hand them 
over to the widget. Our annotations show that it doesn’t, 
because these types are not compatible with Widget. 


7.1 Type System Highlights 


In section 5 and 6, we presented types for safe objects 
and for values in the browser environment. We build 
upon earlier work on type systems that has been ap- 
plied to JavaScript [18]. In this section, we present the 
non-standard portions of our type system that we use for 
typing operations on objects, sensitive conditionals, and 
some idiosyncrasies of JSLint and adsafe.js. 


Object Properties and String Set Types In 
JavaScript, object properties (or “fields”) are merely 
string indices: even o.x is just an alias for o["x"]. 
In addition, these strings can be computed and flow 
through the program before they are used to look 
up fields. Sandboxes thus deal with whitelists and 
blacklists of property names. To model this, we enrich 
the type language with sets of strings. For example, 
("__ nodes __","__proto_")~ is the type of all 
strings except "__ nodes __" and "_ proto __", and 
("x", "£oo")* is the type of exactly "x" and "foo", 
Figure 7 shows typing rules and operations for string 
sets. Sets support combination via unions, subtyping via 


20th USENIX Security Symposium 


; + 
TsaNesEs ST-STRINGSET 


Vf € (fi,.--),f € (s1,---) 
DT E str : (str)* e .: am =a 
ST-STRINGSET ~ 
ely ot ee,u) Sac 
his? Sl Sinec) aig” Saou 
ST-STRING™ EQuIv-STR 
(fixeen)— <¢ StF Str <: ()~ 


Figure 7: Typing and operations on string set types 


adding new strings, and subtyping of positive and nega- 
tive sets. Both kinds of string sets can also be promoted 
to the common supertype of Str, which is equivalent to 
the negative string set with no entries. 

Equipped with string sets, we can describe the typing 
of object property dereference. When the property name 
is a string set, we union the types of the properties that 
are members of the string set, paying careful attention to 
absent fields and prototype lookup. Figure 8 shows the 
rule T-LOOKUP, with examples shown in figure 9. 

String sets allow the type checker to avoid certain 
named properties, as in the last example of figure 9, 
where the "eval" property has the bad type & but the 
string set type of the index excludes "eval". The rule 
for property update (not shown here) is similar but sim- 
pler, as property update in JavaScript does not recur in- 
side prototypes, and only operates on the property names 
of the top-level object. 


If-Splitting A reference monitor has various runtime 
checks to ensure that protected objects—DOM objects 
and browser functions in ADsafe’s case—are only ma- 
nipulated in safe and well-defined ways. For example, 
when setTimeout’s first argument is a string, rather than 
a function, it exhibits eval-like behavior, which violates 
ADsafety’s constraints. Thus we instead give it the type 


(Widget — Widget) x Widget + Num 


Doing so forces the first argument to be a function and, 
in particular, not a string. Now consider its use: 
later: function (func, timeout) 
/«: Widget x Widget + Widget «/ { 
if (typeof func === "function") { 
setTimeout (func, timeout || 0); 
} else { error(); } 


Because ADSAFE.later is exported to widgets, it can 
only assume the Widget type for its arguments, including 
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{x} is shorthand for {x : F,, proto : Tp, code : Ty, fi : Fi,...} 
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Figure 8: Typing object lookup 


Object Type T, String Type S 
{proto : Null, * : Bool, "x" : Num} 


fields(T., S) 


(1xn)T Num 


{proto : Null, * : Bool, "x" : Num} (mx, myn)t Num U Bool U Undef 
{proto + Object,» Nam} 
( aa 





{proto = Object, x: Num, *tosering® + Absent} 


{proto : Null, * : Str, "x" : Num, "y" : Bool, "eval" : &} Str U Num U Bool U Undef 





{proto : Null, * : Str, "x" : Num, "y" : Bool, "eval" : & 


untypable 


("eval")? 


Figure 9: Examples of property lookup using fields 


func. A traditional type checker would thus conclude 
that func has type Widget everywhere in later. Because 
Widget includes Str, the invocation of set Timeout would 
yield a type error—even though this is precisely what the 
conditional in later is avoiding! 

If-splitting is the name for a collection of techniques 
that address this problem [39]. Our particular solution 
uses a refinement of this idea, called flow typing [18], 
which complements type-checking with flow analysis. 
The analysis informs the type checker that due to the 
typeof check, uses of func in the then-branch of the 
conditional can in fact be refined from the large Wid- 
get type of Str U Num U ... to the function type that 
setTimeout requires. 


7.2 Required Refactorings 


Our type system cannot type check the ADsafe runtime 
as-is; we need to make some simple refactorings. The 
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need for these refactorings does not reflect a weakness 
in ADsafe. Rather, they are programming patterns that 
we cannot verify with our type system. To gain confi- 
dence that we didn’t change ADsafe’s behavior, we run 
ADsafe’s sample widgets against our refactored version 
of ADsafe, and they behave as expected. We describe 
these refactorings below: 


Additional reject name Checks ADsafe uses 
reject_name to check accesses and updates to object 
properties in adsafe.js. If-splitting uses these checks 
to narrow string set types and type-check object property 
references. However, ADsafe does not use reject _name 
in every case. For example, it uses a regular expression 
to parse DOM queries, and uses the result to look up 
object properties. Because our type system makes 
conservative assumptions about regular expressions, it 
would erroneously indicate that a blacklisted field may 
be accessed. Thus, we add calls to reject_name so the 
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type system can prove that the accesses and assignments 
are safe. 


Inlined reject_global Checks Most Bunch methods 
start by asserting reject_global (this), which ensures 
that this is Widget-typed in the rest of the method. 
Our type system cannot account for such non-local side- 
effects, but once we inline reject_global, if-splitting 
is able to refine types appropriately (for instance, in the 
Bunch. prototype.append example early in this section). 


makeableTagName ADsafe’s whitelist of safe DOM ele- 
ments is defined as a dictionary: 


var makeableTagName = 


{ "div": true, "p": true, "b": true, ... }; 


This dictionary omits an entry for "script". The 
document .createElement DOM method creates new 
nodes. We ensure that <script> tags are not created by 
typing it as follows: 


document .createElement : ("script") — HTML 


ADsafe uses its tag whitelist before calling 
document .createElement: 
if (makeableTagName[tagName] === true) { 


document .createElement (tagName) ; 


} 


Our type checker cannot account for this check. We in- 
stead refactor the whitelist (a trick noted elsewhere [29]): 


var makeableTagName = 


{ "div": "div", prs "pm, "Hs "HM LL, }; 


The type of these strings are ("div")*, ("p")T,("b")*, 
etc., so that makeableTagName[tagName] has type 
("div","p","b",...)?. Since this finite set of strings 
excludes "script", it now matches the argument type 


of createElement. 


7.3. Cheating and Unverifiable Code 


A complex body of code like the ADsafe runtime cannot 
be type-checked from scratch in one sitting. We there- 
fore found it convenient to augment the type system with 
a cheat construct that ascribes a given type to an ex- 
pression without descending into it. We could thus use 
cheat when we encountered an uninteresting type error 
and wanted to make progress. Our goal, of course, was 
to ultimately remove every cheat from the program. 

We were unable to remove two cheats, leaving eleven 
unverified source lines in the 1,800 LOC ADsafe run- 
time. We can, in fact, ascribe interesting types to these 
functions, but checking them is beyond the power of our 
type system. The details may not be of interest to the 
general reader, but the web content contains the full body 
of unverified code and a discussion of its types. 
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8 ADsafety Redux 


Sections 5 and 7 gave the details of our strategy for mod- 
eling JSLint and verifying adsafe.js. In this section, 
we combine these results and relate it to the original def- 
inition of ADsafety (definition 1). The use of a type sys- 
tem allows us to make straightforward, type-based argu- 
ments of safety for the components of ADsafe. 

The lemmas below formally reason about type- 
checked widgets. Claim | (section 5.2) establishes that 
linted widgets are in fact typable. Therefore, we do not 
need to type-check widgets. Widget programmers can 
continue to use JSLint and do not need to know about 
our type checker. However, given the benefits of unifor- 
mity provided by a type checker over ad hoc methods 
like JSLint (section 9 details one exploit that resulted 
from such an ad hoc approach), programmers may be 
well served to use our type checker instead. 


Type Soundness Most type systems come with a 
soundness theorem that is stated as progress (well-typed 
programs do not error) and preservation (well-typed pro- 
grams do not violate their types). 

We do not attempt to establish progress. Establishing 
it would require many more refactorings in the ADsafe 
runtime, and many lintable widgets would be untypable. 
Because runtime errors are perfectly acceptable (they 
halt execution before something bad happens), we re- 
lax some of the typing rules in an existing type sys- 
tem [18]— which does exhibit progress —to instead allow 
some JavaScript errors (e.g., applying non-function val- 
ues or looking up fields of null). We do still need an “un- 
typed progress” theorem that states that our JavaScript 
semantics fully models all error cases. This theorem is 
provided by Guha, et al. [17]. 

We restate and prove preservation for the extensions 
to Guha et al.’s type system, which is applicable to all 
JavaScript programs.’ Stated formally: 


Lemma 1 (Type Preservation) /f, for an expression e, 
type T, environment V and abstract heap &, 

1; SP os 

2. 4; Fe: T, and 

3. ce oa'e'; 


then there exists aX’ with X! + o' and &';Tt e': T. 


Our assumed environment (section 6) provides the ab- 
stract heap » and abstract environment I’, which model 
the initial state of the browser, o. Given this lemma, we 
can make type-based statements about the combination 
of widgets and adsafe.js: 


7For the formal proof, see Guha et al. [18] and the supplemental 
material on the web. 
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Theorem 1 (ADsafety) For all widgets p, if 
1. all subexpressions of p are Widget-typable, 
2. adsafe.js is typable, 
3. adsafe.js runs before p, and 
4. op + o'p’! (single-step reduction), 
then at every step p’, p’ also has the type Widget. 


This theorem says that for all widgets » whose subex- 
pressions are Widget-typed, if adsafe.js type-checks 
and runs in the browser environment, p can take any 
number of steps and still have the Widget type. Since 
types are preserved, two further key lemmas hold during 
execution: 


Lemma 2 (Widgets cannot load new code at runtime) 
For all widgets e, if all variables and sub-expressions of 
e are Widget-typed, then e does not load new code. 


By section 6, eval-like functions are &-typed, hence 
cannot be referenced by widgets or by the ADsafe run- 
time. Furthermore, functions that only eval when given 
strings, such as setTimeout, have restricted types that 
disallow string-typed arguments. Therefore, neither the 
widget nor the ADsafe runtime can load new code. WH 


Lemma 3 (Widgets do not obtain DOM references) 
For all widgets e, if all variables and sub-expressions of 
e are Widget-typed, then e does not obtain direct DOM 
references. 


The type of DOM objects is not subsumed by the Widget 
type. All functions in the ADsafe runtime have the type: 


[Widget U Global]Widget - -- + Widget 


Thus, functions in the ADsafe runtime do not leak DOM 
references, as long as they are only applied to Widget- 
typed values. Since all subexpressions of the widget e 
are Widget-typed, all values that e passes to the ADsafe 
runtime are Widget-typed. By the same argument, e can- 
not directly manipulate DOM references either. a 


Widgets can only manipulate their DOM subtree 
We cannot prove this claim with our tools. JSLint 
enforces this property by also verifying the static 
HTML of widgets; it ensures that all element IDs 
are prefixed with the widget’s ID. The wrapper for 
document .getElementById ensures that the widget ID 
is a prefix of the element ID. Verifying JSLint’s HTML 
checks is beyond the scope of this work. 

In addition, the wrapper for Element .parentNode 
checks to see if the current element is the root of the wid- 
get’s DOM subtree. It is not clear if our type checker can 
express this property without further extensions. 
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ADSAFE.go("AD_", function (dom, lib) { 
var myWindow, fakeNode, fakeBunch, realBunch; 


fakeNode = { 
appendChild: function(elt) { 
myWindow = elt.ownerDocument .defaultView; 


}, 
tagName: "div", 
value: null 


ea 


fakeBunch = {"__ nodes ": [£akeNode] }; 


realBunch = dom.tag("p") ; 
fakeBunch.value = realBunch.value; 
fakeBunch.value(""); // calls phony appendChild 


myWindow.alert ("hacked") ; 


}); 


Figure 10: Exploiting JSLint 


Widgets cannot communicate This claim is false; 
section 9 presents a counterexample. 


9 Bugs Found in ADsafe 


We have implemented the type system presented in this 
paper, and applied it to the ADsafe source. The imple- 
mentation is about 3,000 LOC, and takes 20 seconds to 
check adsafe.js (mainly due to the presence of recur- 
sive types). In some cases, type-checking failed due to 
the weakness of the type checker; these issues are dis- 
cussed in section 7.2. The other failures, however, rep- 
resent genuine errors in ADsafe that were present in the 
production system. The same applies to instances where 
JSLint and our typed model of it failed to conform. All 
the errors listed below have been reported, acknowledged 
by the author, and fixed. 


Missing Static Checks JSLint inadvertently allowed 
widgets to include underscores in quoted field names. In 
particular, the following expression was deemed safe: 


fakeBunch = { "nodes": [ fakeNode ] }; 


A malicious widget could then create an object with an 
appendchild method, and trick the ADsafe runtime into 
invoking it with a direct reference to an HTML element, 
which is enough to obtain window and violate ADsafety: 


fakeNode = { 
appendChild: function(elt) { 
myWindow = elt.ownerDocument.defaultView; 


} 
}i 


The full exploit is in figure 10. 
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ADSAFE.go("AD_", function (dom, lib) { 
var called = false; 
var obj = { 
"toString": function() { 


if (called) { 
return "url (evil.xml#exp) "; 


} 

else { 
called = true; 
return "dummy"; 


} 
} 
}i 
dom.append(dom.tag("div") ) ; 
dom.q("div") .style("MozBinding", 0); 


})); 


<!-- evil.xml --> 

<?xml version="1.0"?> 
<bindings><binding id="exp"> 
<implementation><constructor> 
document.write ("hacked") 
</constructor></implementation> 
</binding></bindings> 


Figure 11: Firefox-specific Exploit for ADsafe 


This bug manifested as a discrepancy between our 
model of JSLint as a type checker and the real JSLint. 
Recall from section 5 that all expressions in widgets 
must have type Widget (defined in figure 4). For 
{ "nodes __": [fakeNode] } to type as Widget, the 
"nodes" field must have type Array(HTML)UUndef. 
However, [fakeNode] has type Widget, which signals the 
error. 

JSLint similarly allowed" proto" and other fields 
to appear in widgets. We did not investigate whether they 
can be exploited as above, but setting them causes unan- 
ticipated behavior. Fixing JSLint was simple once our 
type checker found the error. (An alternative solution 
would be to use our type system as a replacement for 
JSLint.) We note that when the ADsafe option of JSLint 
was first announced,® its author offered: 


If [a malicious client] produces no errors when 
linted with the ADsafe option, then I will buy 
you a plate of shrimp. 


After this error report, he confirmed, “I do believe that I 
owe you a plate of shrimp”. 


Missing Runtime Checks 
in adsafe.js_ incorrectly assumed that they 
were applied to primitive strings. For example, 
Bunch.prototype.style began with the following 


Many functions 


8tech.groups.yahoo.com/group/caplet/message/ 
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check, to ensure that widgets do not programmatically 
load external resources via CSS: 


function(name, value) { 
{ // regex match? 


Bunch.prototype.style = 
if (/url/i.test (value) ) 
error (); 


} 
}i 
Thus, the following widget code would signal an error: 


someBunch. style ("background", 
"url (http: //evil.com/image.jpg)") ; 


The bug is that if value is an object instead of a 
string, the regular-expression test method will invoke 
value.toString(). 

A malicious widget can construct an object with a 
stateful toString method that passes the test when first 
applied, and subsequently returns a malicious URL. In 
Firefox, we can use such an object to load an XBL re- 
source? that contains arbitrary JavaScript (figure 11). 

We ascribe types to JavaScript’s built-ins to prevent 
implicit type conversions. Therefore, we require the ar- 
gument of Regexp.test to have type Str. However, since 
Bunch. prototype.style can be invoked by widgets, its 
type is Widget x Widget — Widget, and thus the type of 
value is Widget. 

This bug was fixed by adding a new string check 
function to ADsafe, which is now called in 18 functions. 
All these functions are not otherwise exploitable, but a 
missing check would cause unexpected behavior. The 
fixed code is typable. 


Counterexamples to Non-Interference Finally, a 
type error in Bunch. prototype. getStyle helped us gen- 
erate a counterexample to ADsafe’s claim of widget non- 
interference (definition 1, part 4). The getstyle method 
is available to widgets, so its type must be Widget — 
Widget. The following code is the essence of get Style: 


Bunch.prototype.getStyle = function (name) { 


var sty; 

reject _global (this) ; 

sty = window.getComputedStyle(this. node_)j; 
return sty [name] ; 


} 


The bug above is that name is unchecked, so it may index 
arbitrary fields, such as proto_: 


someBunch.getStyle("_ proto __"); 


This gives the widget a reference to the prototype of the 
browser’s cSsStyleDeclaration objects. Thus the re- 
turn type of the body is not Widget, yielding a type error. 

A widget cannot exploit this bug in isolation. How- 
ever, it can replace built-in methods of CSs style objects 


°nttps: //developer.mozilla.org/en/XBL 
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and interfere with the operation of the hosting page and 
other widgets that manipulate styles in JavaScript. 

This bug was fixed by adding a reject_name check 
that is now used in this and other methods. Despite 
the fix, ADsafe still cannot enforce non-interference, 
since widgets can reference and affect properties of other 
shared built-ins: 


var arr = []; 
arr.concat.channel = "shared data"; 


The author of ADsafe pointed out the above example and 
retracted the claim of non-interference. 


Prior Exploits Before and during our implementation, 
other exploits were found in ADsafe and reported [27- 
29]. We have run our type checker on the exploitable 
code, and our tools catch the bugs and report type errors. 


Fixing Bugs and Tolerating Changes — Each of our bug 
reports resulted in several changes to the source, which 
we tracked. In addition to these changes, adsafe.js 
also underwent non-security related refactorings during 
the course of this work. Despite not providing its author 
our type checker, we were easily able to continue type- 
checking the code after these changes. One change in- 
volved adding a number of new Bunch methods to extend 
the API. Keeping up-to-date was a simple task, since all 
the new Bunch methods could be quickly annotated with 
the Widget type and checked. In short, our type checker 
has shown robustness in the face of program edits. 


10 Beyond ADsafe 


Our security type system is capable of verifying useful 
properties about JavaScript programs in general. Sec- 
tions 5, 6, and 7 present carefully crafted types that we 
ascribe to the browser API and adsafe. js, and use to 
model widget programs. Proving these types hold over 
the ADsafe runtime library and JSLint-ed widgets guar- 
antees robust sandboxing properties for ADsafe. 
Verifications for other sandboxes would require the de- 
sign of new types, to accurately model checked, rewritten 
programs and their interface to the sandbox, but not nec- 
essarily a new type system. Indeed, our type-based strat- 
egy provides a concrete roadmap for sandbox designers: 
1. Formally specify the language of widgets using a 
type system; 
2. use this specification to define the interface between 
the sandbox and untrusted code; and, 
3. check that the body of the sandbox adheres to this 
interface by type-checking. 
In particular, developers of new sandboxes should be 
aware of this strategy. Rather than trying to retrofit the 


USENIX Association 


type system’s features onto existing static checks, the 
sandbox designer can work with the type system to guar- 
antee safety constructively from the start. Tweaks and 
extensions to the type system are certainly possible —for 
example, one may want to design a sandboxing frame- 
work that forbids applying non-function values and look- 
ing up fields of null, which the current type system al- 
lows (section 8). 

ADsafe shares many programming patterns with other 
Web sandboxes (section 3), but doesn’t cover the full 
range of their features. We outline some of the exten- 
sions that could be used to verify them here: 


Reasoning About Strings Our type system lets pro- 
grammers reason about finite sets of strings and use these 
sets to lookup fields in objects. To verify Caja, we would 
need to reason about string patterns. For example, Caja 
uses the field named "foo"+ "_w__" to store a flag that 
determines if the field "foo" is writable. 


Abstracting Runtime Tests Our type system accounts 
for inlined runtime checks, but requires some refactor- 
ings when these checks are abstracted into predicates. 
Larger sandboxes, like Caja, have more predicates, so 
refactoring them all would be infeasible. We could in- 
stead use ideas from occurrence typing [39], which ac- 
counts for user-defined predicates. 


Modeling the Browser Environment ADsafe wraps a 
small subset of the DOM API and we manually check that 
this subset is appropriately typed in the initial type envi- 
ronment. This approach does not scale to a sandbox that 
wraps more of the DOM. If the type environment were 
instead derived from the C++ DOM implementation, we 
would have significantly greater confidence in our envi- 
ronmental assumptions. 


11 Related Work 


Verifying JavaScript Web Sandboxes ADsafe [9], 
BrowserShield [35], Caja [33], and FBJS [13] are 
archetypal Web sandboxes that use static and dynamic 
checks to safely host untrusted widgets. However, the se- 
mantics of JavaScript and the browser environment con- 
spire to make JavaScript sandboxing difficult [17, 26]. 
Maffeis et al. [27] use their JavaScript semantics to 
develop a miniature sandboxing system and prove it cor- 
rect. Armed with the insight gained by their semantics 
and proofs, they find bugs in FBJS and ADsafe (which 
we also catch). However, they do not mechanically ver- 
ify the JavaScript code in these sandboxes. They also for- 
malize capability safety and prove that a Caja-like sub- 
set is capability safe [30]. However, they do not verify 
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the Caja runtime or the actual Caja subset. In contrast, 
we verify the source code of the ADsafe runtime and ac- 
count for ADsafe’s static checks. 

Taly, et al. [38] develop a flow analysis to find bugs 
in the ADsafe runtime (that we also catch). They sim- 
plify the analysis by modeling ECMAScript 5 strict 
mode, which is not fully implemented in any current Web 
browser. In contrast, ADsafe is designed to run on cur- 
rent browsers, and thus supports older and more permis- 
sive versions of JavaScript. We use the semantics and 
tools of Guha, et al. [17], which does not limit itself to 
the strict mode, so we find new bugs in the ADsafe run- 
time. In addition, Taly, et al. use a simplified model of 
JSLint. In contrast, we provide a detailed, type-theoretic 
account of JSLint, and also test it. We can thus find se- 
curity bugs in JSLint as well. 

Lightweight Self-Protecting JavaScript [31,34] is a 
unique sandbox that does not transform or validate wid- 
gets. It instead solely uses reference monitors to wrap 
capabilities. These are modeled as security automata, 
but the model ignores the semantics of JavaScript. In 
contrast, this paper and the aforementioned works are 
founded on detailed JavaScript semantics. 

Yu, et al. [40] use JavaScript sandboxing techniques 
to enforce various security policies on untrusted code. 
Their semantic model, CoreScript, simplifies the DOM 
and scripting language. CoreScript cannot be used to 
mechanically verify the JavaScript implementation of a 
Web sandbox, which is what we present in this paper. 


Modeling the Web Browser There are formal mod- 
els of Web browsers that are tailored to model whole- 
browser security properties [1,6]. These do not model 
JavaScript’s semantics in any detail and are therefore or- 
thogonal to semantic models of JavaScript [17,26] that 
are used to reason about language-based Web sandboxes. 
In particular, ADsafe’s stated security goals are lim- 
ited to statements about JavaScript and the DOM (sec- 
tion 4). Therefore, we do not require a comprehensive 
Web-browser model. 


Static Analysis of JavaScript GateKeeper [15] uses a 
combination of program analysis and runtime checks to 
apply and verify security policies on JavaScript widgets. 
GateKeeper’s program analysis is designed to model 
more complex properties of untrusted code than we ad- 
dress by modeling JSLint. However, the soundness of its 
static analysis is proven relative to only a restricted sub- 
language of JavaScript, whereas X75 handles the full lan- 
guage. In addition, they do not demonstrate the validity 
of their run-time checks. 

Chugh et al. [8] and VEX [5] use program analy- 
sis to detect possibly malicious information flows in 
JavaScript. Our type system cannot specify information 
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flows, although we do use it to discover that ADsafe fails 
to enforce a desirable information flow property. VEX’s 
authors acknowledge that it is unsound, and Chugh et al. 
do not provide a proof of soundness for their flow analy- 
sis. Our type system and analysis are proven sound. 
Other static analyses for JavaScript [16, 21,22] are not 
specifically designed to encode and check security. 


Type Systems Our type checker is based on that of 
Guha, et al. [18]. Theirs has a restrictive type system for 
objects that we fully replace to type check ADsafe. We 
also add simple extensions to their flow typing system 
to account for additional kinds of runtime checks em- 
ployed by ADsafe. Their paper surveys other JavaScript 
type systems [2,19] that can type-check other patterns 
but have not been used to verify security-critical code, 
which is the goal of this paper. Our treatment of ob- 
jects is also derived from ML-ART [36], but accounts 
for JavaScript features and patterns such as function ob- 
jects, prototypes, and objects as dictionaries. 


Language-Based Security Schneider et al. [37] sur- 
vey the design and type-based verification of language- 
based security systems. JavaScript Web sandboxes are 
inlined reference monitors [12]. Guha, et al. [17] offer a 
type-based strategy to verify these, but their approach— 
which depends on building a custom type rule around 
each check in the reference monitor—does not scale to 
a program of the size of ADsafe. Furthermore, their 
custom rules essentially hand-code if-splitting, which we 
obtain directly from the underlying type system. 
Cappos, et al. [7] present a layered approach to build- 
ing language sandboxes that prevents bugs in higher lay- 
ers from breaking the abstractions and assurances pro- 
vided by lower layers. They use this approach to build a 
new sandbox for Python, whereas we verify an existing, 
third-party JavaScript sandbox. However, our verifica- 
tion techniques could easily be used from the onset to 
build a new sandbox that is secure by construction. 


IFrames [Frames are widely used for widget isola- 
tion. However, JavaScript that runs in an [Frame can still 
open windows, communicate with servers, and perform 
other operations that a Web sandbox disallows. Further- 
more, inter-frame communication is difficult when de- 
sired; there are proposals to enhance [Frames to make 
communication easier and more secure [20]. Language- 
based sandboxing is somewhat orthogonal in scope, is 
more flexible, and does not require changes to browsers. 


Runtime Security Analysis of JavaScript There are 
various means to secure widgets that do not employ 
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language-based security. Some systems rely on mod- 
ified browsers, additional client software, or proxy 
servers [10, 11,23—25, 32, 40]. Some of these propose al- 
ternative Web programming APIs that are designed to be 
secure. Language-based sandboxing has the advantage 
of working with today’s browsers and deployment meth- 
ods, but our verification ideas could potentially apply to 
the design of some of these systems, too. 


Acknowledgments 


We thank Douglas Crockford for discussions, open- 
mindedness, and insightful feedback (and the promise 
of certain crustaceans); Mark S. Miller for enlighten- 
ing discussions; Matthias Felleisen, Andrew Ferguson, 
and David Wagner for numerous helpful comments that 
helped us understand weaknesses in exposition; the NSF 
for financial support; and StackOverflow, as well as 
Claudiu Saftoiu (our lower-latency version of StackOver- 
flow), for unflagging attention to detail. 


References 


[1] D. Akhawe, A. Barth, P. E. Lam, J. Mitchell, and 
D. Song. Towards a Formal Foundation of Web 
Security. In JEEE Computer Security Foundations 
Symposium, 2010. 


[2] C. Anderson, P. Giannini, and S. Drossopoulou. 
Towards type inference for JavaScript. In Eu- 
ropean Conference on Object-Oriented Program- 
ming, 2005. 


[3] J. PB. Anderson. Computer Security Technology 
Planning Study. Technical Report ESD-TR-73-51, 
Deputy for Command and Management Systems, 
HQ Electronic Systems Division (AFSC), L. G. 
Handscom Field, Bedford, Massachusetts 01730, 
October 1972. 


[4] I. Awad, T. Close, A. Felt, C. Jackson, 
B. Laurie, F. Lee, K.-P. Lee, D.-S. Hopwood, 


J. Nagra, E. Sachs, M. Samuel, M. Stay, 
and D. Wagner. Caja external security re- 
view. Technical report, Google Inc., 2008. 


http: //google-caja.googlecode. 
com/files/Caja_External_ Security_ 
Review _v2.pdf. 


[5] S. Bandhakavi, S. T. King, P. Madhusudan, and 
M. Winslett. VEX: Vetting browser extensions for 
security vulnerabilities. In USENIX Security Sym- 
posium, 2010. 


[6] A. Bohannon and B. C. Pierce. Featherweight Fire- 
fox: Formalizing the Core of a Web Browser. In 


USENIX Association 


o 
oo 
“4 


[10] 


[11] 


[12 


“4 


[13 


“4 


[14 


— 


[16] 


[17] 


[18] 


Usenix Conference on Web Application Develop- 
ment (WebApps), 2010. 


J. Cappos, A. Dadgar, J. Rasley, J. Samuel, 
I. Beschastnikh, C. Barsan, A. Krishnamurthy, and 
T. Anderson. Retaining Sandbox Containment De- 
spite Bugs in Privileged Memory-Safe Code. In 
ACM Conference on Computer and Communica- 
tions Security (CCS), 2010. 


R. Chugh, J. A. Meister, R. Jhala, and S. Lerner. 
Staged information flow for JavaScript. In ACM 
SIGPLAN Conference on Programming Language 
Design and Implementation, 2009. 


D. Crockford. 
2011. 


ADSafe. www.adsafe.org, 


A. Dewald, T. Holz, and F. C. Freiling. ADSand- 
box: Sanboxing JavaScript to fight Malicious Web- 
sites. In Symposium On Applied Computing (SAC), 
2010. 


M. Dhawan and V. Ganapathy. Analyzing in- 
formation flow in JavaScript-based browser exten- 


sions. In Computer Security Applications Confer- 
ence, 2009. 


U. Erlingsson. The Inlined Reference Monitor Ap- 
proach to Security Policy Enforcement. PhD thesis, 
Cornell University, 2003. 


Facebook. FBJS, 2011. http: //developers. 
facebook.com/docs/fbjs/. 


M. Finifter, J. Weinberger, and A. Barth. Prevent- 
ing Capability Leaks in Secure JavaScript Subsets. 
In Network and Distributed System Security Sym- 
posium, 2010. 


S. Guarnieri and B. Livshits. GATEKEEPER: 
Mostly static enforcement of security and reliabil- 
ity policies for JavaScript code. In USENIX Secu- 
rity Symposium (SSYM), 2009. 


A. Guha, S. Krishnamurthi, and T. Jim. Static anal- 
ysis for Ajax intrusion detection. In Jnternational 
World Wide Web Conference, 2009. 


A. Guha, C. Saftoiu, and S. Krishnamurthi. The 
Essence of JavaScript. In European Conference on 
Object-Oriented Programming, 2010. 


A. Guha, C. Saftoiu, and S. Krishnamurthi. Typing 
Local Control and State Using Flow Analysis. In 
European Symposium on Programming, 2011. 


20th USENIX Security Symposium 185 


186 


[19] 


[21] 


[22] 


[23 


= 


[24] 


[25] 


[26 


= 


[27 


wo 


[28 


“4 


[29 


“= 


[30 


= 


[31] 


P. Heidegger and P. Thiemann. Recency types 
for dynamically-typed, object-based languages: 
Strong updates for JavaScript. In ACM SIGPLAN 
International Workshop on Foundations of Object- 
Oriented Languages, 2009. 


C. Jackson and H. J. Wang. Subspace: Secure 
Cross-Domain Communication for Web Mashups. 
In International World Wide Web Conference, 
2007. 


S. H. Jensen, A. Mgller, and P. Thiemann. Type 
analysis for JavaScript. In International Static 
Analysis Symposium, 2009. 


S. H. Jensen, A. Mgller, and P. Thiemann. Interpro- 
cedural analysis with lazy propagation. In Interna- 
tional Static Analysis Symposium, 2010. 


T. Jim, N. Swamy, and M. Hicks. BEEP: Browser- 
enforced embedded policies. In International 
World Wide Web Conference, 2007. 


E. Kiciman and B. Livshits. AjaxScope: A platform 
for remotely monitoring the client-side behavior of 
web 2.0 applications. In Symposium on Operating 
System Principles, 2007. 


M. T. Louw, K. T. Ganesh, and V. Venkatakrish- 
nan. AdJail: Practical enforcement of confidential- 
ity and integrity policies on Web advertisements. In 
USENIX Security Symposium (SSYM), 2010. 


S. Maffeis, J. Mitchell, and A. Taly. An Operational 
Semantics for JavaScript. In ASTAN Symposium on 
Programming Languages and Systems, pages 307— 
325, 2008. 


S. Maffeis, J. C. Mitchell, and A. Taly. Isolating 
JavaScript with Filters, Rewriting, and Wrappers. 
In European Symposium on Research in Computer 


Security (ESORICS), 2009. 


S. Maffeis, J. C. Mitchell, and A. Taly. Run- 
time enforcement of secure javascript subsets. In 
W2SP’09. TEEE, 2009. 


S. Maffeis, J. C. Mitchell, and A. Taly. Object Ca- 
pabilities and Isolation of Untrusted Web Applica- 
tions. In JEEE Symposium on Security and Privacy. 
IEEE, 2010. 


S. Maffeis, J.C. Mitchell, and A. Taly. Object capa- 
bilities and isolation of untrusted Web applications. 
In IEEE Symposium on Security and Privacy, 2010. 


J. Magazinius, P. H. Phung, and D. Sands. Safe 
Wrappers and Sane Policies for Self Protecting 
JavaScript. In OWASP AppSec Research, 2010. 


20th USENIX Security Symposium 


[32] 


[33] 


[35 


“4 


[36] 


[37] 


o 
ww 
oo 

“4 


[39 


—“ 


[40 


= 


[41] 


L. Meyerovich and B. Livshits. Conscript: Spec- 
ifying and enforcing fine-grained security policies 
for javascript in the browser. In IEEE Symposium 
on Security and Privacy, 2010. 


M. S. Miller, M. Samuel, B. Laurie, I. Awad, 


and M. Stay. Caja: Safe active content 
in sanitized JavaScript. Technical report, 
Google Inc., 2008. http: //google- 


caja.googlecode.com/files/caja-spec- 
-2008-06-07.pdf. 


P. H. Phung, D. Sands, and A. Chudnov. 
Lightweight self-protecting JavaScript. In ACM 
Symposium on Information, Computer and Com- 
munications Security, 2009. 


C. Reis, J. Dunagan, H. J. Wang, O. Dubrovsky, and 
S. Esmeir. BrowserShield: Vulnerability-Driven 
Filtering of Dynamic HTML. In Symposium on Op- 
erating Systems Design and Implementation, 2006. 


D. Rémy. Programming objects with ML-ART, an 
extension to ML with abstract and record types. 
In M. Hagiya and J. Mitchell, editors, Theoreti- 
cal Aspects of Computer Software, volume 789 of 
Springer Lecture Notes in Computer Science, pages 
321-346. Springer Berlin / Heidelberg, 1994. 


F. B. Schneider, G. Morrisett, and R. Harper. A 
Language-Based Approach to Security. In R. Wil- 
helm, editor, Informatics, volume 2000 of Springer 
Lecture Notes in Computer Science, pages 86-101. 
Springer Berlin / Heidelberg, 2001. 


A. Taly, U. Erlingsson, M. S. Miller, J. C. Mitchell, 
and J. Nagra. Automated analysis of security- 
critical JavaScript APIs. In JEEE Symposium on 
Security and Privacy, 2011. 


S. Tobin-Hochstadt and M. Felleisen. The De- 
sign and Implementation of Typed Scheme. In 
ACM SIGPLAN-SIGACT Symposium on Principles 
of Programming Languages (POPL), pages 395— 
406, 2008. 


D. Yu, A. Chander, N. Islam, and I. Serikov. 
Javascript instrumentation for browser security. In 
ACM SIGPLAN-SIGACT Symposium on Principles 
of Programming Languages, 2007. 


C. Yue and H. Wang. Characterizing Insecure 
JavaScript Practices on the Web. In International 
World Wide Web Conference, 2009. 


USENIX Association 


Measuring Pay-per-Install: The Commoditization of Malware Distribution 


Juan Caballero’, Chris Grier**, Christian Kreibich**, Vern Paxson** 


'IMDEA Software Institute 


*UC Berkeley *ICSI 


juan.caballero@imdea.org {grier, vern} @cs.berkeley.edu_ christian @ icir.org 


Abstract 


Recent years have seen extensive diversification of the 
“underground economy” associated with malware and the 
subversion of Internet-connected systems. This trend to- 
wards specialization has compelling forces driving it: mis- 
creants readily apprehend that tackling the entire value-chain 
from malware creation to monetization in the presence of 
ever-evolving countermeasures poses a daunting task requir- 
ing highly developed skills and resources. As a result, 
entrepreneurial-minded miscreants have formed pay-per-install 
(PPI) services—specialized organizations that focus on the in- 
fection of victims’ systems. 


In this work we perform a measurement study of the PPI 
market by infiltrating four PPI services. We develop infrastruc- 
ture that enables us to interact with PPI services and gather and 
classify the resulting malware executables distributed by the 
services. Using our infrastructure, we harvested over a million 
client executables using vantage points spread across 15 coun- 
tries. We find that of the world’s top 20 most prevalent fami- 
lies of malware, 12 employ PPI services to buy infections. In 
addition we analyze the targeting of specific countries by PPI 
clients, the repacking of executables to evade detection, and the 
duration of malware distribution. 


1 Introduction 


Recent years have seen extensive diversification of the 
“underground economy” associated with malware and 
the subversion of Internet-connected systems. This trend 
towards specialization has compelling forces driving it: 
miscreants readily apprehend that tackling the entire 
value-chain from malware creation to monetization in 
the presence of ever-evolving countermeasures poses a 
daunting task requiring highly developed skills and re- 
sources. As a result, market forces foster a service cul- 
ture that has brought about a wide range of specialized 
providers for all stages in the malware-monetization life- 
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cycle, such as malware toolkits [3, 15], packing tools to 
evade antivirus (AV) software [21], “bullet-proof” host- 
ing [4], and forums for buying and selling ill-gotten 
gains [10]. 


At the heart of this ecosystem lies the infection of vic- 
tim computers. Virtually every enterprise in this market 
ultimately hinges on access to compromised systems. To 
meet the demands for wholesale infection of Internet sys- 
tems, a service called pay-per-install (PPI) has risen to 
predominance. Such PPI services play a key role in the 
modern malware marketplace by providing a means for 
miscreants to outsource the global dissemination of their 
malware. Miscreants simply determine the raw number 
of victim systems (including specific geographical distri- 
bution, if desired) that fits within their budget, supply a 
PPI service with payment and malware executables of the 
miscreants’ choice, and in short order their malware is in- 
stalled on thousands of new systems. In today’s market, 
the entire process costs pennies per target host—cheap 
enough for botmasters to simply rebuild their ranks from 
scratch in the face of defenders launching extensive, en- 
ergetic, take-down efforts [6]. 


In this work we perform a measurement study of the 
PPI market by infiltrating four PPI services. We develop 
infrastructure that enables us to (1) interact with PPI ser- 
vices by mimicking the protocol interactions they ex- 
pect to receive from affiliates with whom they have con- 
tracted, and (2) gather and classify the resulting malware 
executables as distributed by the PPI services. We report 
results of infiltrations we conducted in the six months 
between August 2010 and February 2011. 


To our knowledge, our work reflects the first system- 
atic study of the PPI ecosystem as seen from the perspec- 
tive of the downloads pushed out by PPI services down 
to their victims. Security analysts have previously exam- 
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ined PPI services in a top-down manner, by becoming 
affiliates of particular services [7,29]. Our study is in- 
stead based on infiltrating PPI services in a bottom-up 
manner, by creating custom programs that can continu- 
ously download malware specimens that the PPI services 
distribute, enabling us to track the infiltrated PPI services 
over time. 


We harvested over a million client executables us- 
ing vantage points spread across 15 countries. The 
month of August 2010 yielded 57 malware families, in- 
cluding many of the most prevalent infections at the 
time. They include spam bots (Rustock, Grum), fake 
antivirus (Securitysuite, Securityessential), information- 
stealing trojans (Zbot, Spyeye), rootkits (Tdss), DDoS 
bots (Russkill, Canahom), clickers (Gleishug), and ad- 
ware (SmartAdsSolutions). 


Using our geo-diverse vantage points, we measure dif- 
ferences in the geographical preferences of the different 
malware families. We identify families that exclusively 
target the US, the UK, and a variety of European coun- 
tries. We also analyze the rate at which malware authors 
repack their wares to evade hash-based signatures. On 
average, they repack specimens every |1 days, and some 
malware families repack up to twice daily. We track the 
dynamics of campaigns during which a service dissem- 
inates a given malware family in an ongoing push, ob- 
serving a wide temporal range, from specimens that are 
continually distributed over weeks, to pointwise efforts 
lasting only a few hours. We also analyze the particulars 
of how different PPI services interact with their affili- 
ates, including surprising evidence suggesting that some 
affiliates who sell installs to a particular PPI service not 
only buy installs from rival PPI services, but also from 
the very service to which they sell installs—apparently 
to exploit arbitrage. 


2 An Overview of Pay-Per-Install 


The PPI market, as depicted in Figure 1, consists of three 
main actors: clients, PPI providers (or services), and 
affiliates. We begin with an overview of these actors, 
followed by discussion of the transactions they perform 
(Section 2.1) and the means and importance of evading 
detection (Section 2.2). 


Clients are entities that want to install programs onto a 
number of target hosts. They wish to buy installs of their 
programs. The PPI provider receives money from clients 
for the service of installing their programs onto the target 
hosts, where installation comprises distributing the pro- 
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Figure 1: The typical transactions in the PPI market. PPI 
clients provide software they want to have installed, and 
pay a PPI service to distribute the software (®). The PPI 
service conducts downloader infections itself or employs 
affiliates that install the PPI’s downloader on victim ma- 
chines(®). The PPI service pushes out the client’s exe- 
cutables (®). Affiliates receive commission for any suc- 
cessful installations they facilitated (®). 


grams to the target hosts, executing the client programs, 
and tracking successful executions for accounting. 


The PPI provider develops a program, called a down- 
loader, that retrieves and runs client’s executables upon 
installation. The PPI provider may conduct the instal- 
lation of the downloader itself or may outsource distri- 
bution to third parties called affiliates. When a provider 
has affiliates, the provider acts as a middle man that sells 
installs to the clients while buying installs from affili- 
ates that specialize in some specific distribution method 
(e.g., bundling malware with a benign program and dis- 
tributing the bundle via file-sharing networks; drive-by- 
download exploits; or social engineering). PPI providers 
pay affiliates for each target host on which they execute 
the provider’s downloader program. Once the down- 
loader runs, it connects to the PPI provider to download 
the client programs. If the PPI provider does the distri- 
bution itself, we call the service a direct PPI service. If 
the PPI provider runs an affiliate program, we call it an 
affiliate PPI service. 


In general, both reputable and not-so-reputable enti- 
ties use PPI services. In this paper we focus on the use 
of PPI services as a distribution mechanism for malware, 
e.g., bots, trojans, fake AV software, and spyware. To 
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avoid determining what constitutes malware, we limit the 
scope of the paper to PPI services that perform (or al- 
low their affiliates to perform) silent installs on the target 
hosts, i.e., installations that lack the informed consent of 
the owner of the system. Hereafter we use the term PPI 
providers to refer exclusively to those providers that per- 
form or facilitate silent installs. 


2.1 The PPI Ecosystem 


We describe the PPI ecosystem in terms of the transac- 
tions that take place between clients and PPI providers, 
and between PPI providers and their affiliates. 


Clients. Clients profit from the malicious activities en- 
abled by malware they want to deploy on target hosts, 
such as click fraud, stealing user information (e.g., credit 
card numbers, credentials), or selling software to the user 
under false pretense (e.g., fake AV). 


PPI providers allow clients to choose the geographic 
distribution of target hosts. This distinction creates price 
differentiation in the market due to varying demand for 
machines in certain regions and varying target host sup- 
ply. Clients pay only per unique install, i.e., for one in- 
stallation of their program on a given target host. 


PPI providers. PPI providers profit from installation 
fees paid by the clients. PPI install rates vary from 
$100-$180 for a thousand unique installs in the most 
demanded regions (often the US and the UK, and more 
recently other European nations), down to $7—$8 in the 
least popular ones (predominantly Asia) [12, 13, 19]. In 
this study, we observe PPI providers installing multiple 
client programs on the same target host, and have not ob- 
served attempts to secure exclusive use of a target host 
on behalf of a client. Exclusivity of a host is difficult to 
guarantee because a PPI provider cannot generally know 
whether a target host already runs other malware (e.g., 
a rival PPI downloader that installs competitors of the 
client program). In addition, it is very difficult for clients 
to validate that the PPI service only installed their mal- 
ware on a host. 


Affiliate PPI services give their affiliates a PPI down- 
loader program personalized with their unique affiliate 
identifier. The service credits affiliates for executing their 
specific PPI downloader on a target host. Affiliates only 
receive credit for confirmed installs of their PPI down- 
loader. The confirmation takes the form of the PPI down- 
loader sending the personalized affiliate identifier to the 
PPI provider after downloading and executing the client 
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programs. Thus, affiliates receive credit only after deliv- 
ering the installs. 


Affiliates. Affiliates profit from the installs performed on 
behalf of the PPI provider, with the distribution method 
remaining transparent to the clients. Affiliates might in 
fact be botmasters that compromise hosts, install their 
own malware, and then task their malware with down- 
loading and installing the PPI downloaders as one means 
for monetizing their botnet. When doing so, the bot- 
master relinquishes exclusive control of the hosts in ex- 
change for the install payments from the PPI service. The 
same botmasters might work with multiple PPI providers 
simultaneously to maximize the income from each bot, 
installing multiple affiliate binaries on each of their hosts. 


Indeed, the market has a somewhat fundamental 
conflict-of-interest, in that the more installs a botmas- 
ter/affiliate provides, the more payment they receive; but 
each install degrades the quality of previous installs, be- 
cause the likelihood of the owner of the system discern- 
ing they have become infected, and remedying the situ- 
ation, rises with the volume of malicious installs on the 
system. 


2.2 Evading Detection 


AV software may detect and block any program in the 
installation chain, making it difficult to sustain installs. 
Therefore, providing stealthy executables is a key objec- 
tive for both PPI providers and clients. In the PPI ecosys- 
tem, clients are often in charge of making their programs 
stealthy before giving them to the PPI provider, while af- 
filiates rely upon the PPI provider to provide them with 
a stealthy downloader. 


To render programs stealthy, both PPI providers and 
clients employ packer programs sold by third parties [21, 
23]. Packers change the program content so that its sig- 
nature (e.g., MD5 hash) differs even though the pro- 
gram’s functionality has not changed. Sophisticated 
packers may also change the program size and add de- 
tection techniques for debuggers and virtual machines, 
which are commonly used by analysts. PPI providers 
have responsibility for packing the PPI downloaders for 
each affiliate and testing that the resulting executable 
remains undetected by AV software. In addition, PPI 
providers instruct affiliates and clients not to test their 
programs on free malware scanners [30, 32], because 
these services often redistribute samples to AV ven- 
dors. The vendors may then add new signatures to their 
databases, thus uncloaking the programs. We analyze 


20th USENIX Security Symposium 189 


190 


iFrameCash 
iframecash.biz 
iframemoney.biz 
iframedollars.biz 
buytraff.biz 


iFrameDollars 


iframedollars.com 


InstallsCash 


installscash.org 


GangstaBucks 


gangstabucks .com 


iFrameDollars 
iframedollars.biz 
iframedollars.com 


Earning4U 


earning4u.com 


2005 2006 2007 2008 2009 2010 2011 


Figure 2: Brands used by the LoaderAdv PPI service 
over time. The domains under each brand correspond 
to known front-ends for affiliates. 


how frequently clients repack their programs in Sec- 
tion 4.2. 


3 Infiltrating PPI Infrastructure 


In this section, we first describe how we identified the 
four PPI services we infiltrate, and evaluate our coverage 
of the PPI ecosystem. We then explain the processing 
pipeline we have developed for milking executables from 
PPI services and classifying them. 


3.1 Identifying PPI services 


A good starting point for identifying PPI services to in- 
filtrate is PPI forums [27,28], which mainly serve as a 
means for advertising affiliate PPI services to attract new 
affiliates. General underground forums sometimes offer 
the same advertisements. One challenge when study- 
ing PPI services concerns how to identify the different 
brands used by the same PPI service over time. We ap- 
proached this task by analyzing public information, in- 
cluding copies of any old front-ends [14], forums used 
to advertise affiliate PPI services [27,28], and previous 
analysis by security analysts [7,29]. 

We selected four affiliate PPI programs for infiltration: 
LoaderAdv, GoldInstall, Virut, and Zlob. We use these 
names to refer to the respective PPI services, regardless 
of their branded program names over time. Figure 2 il- 
lustrates such branding, employed by the LoaderAdv ser- 
vice. 


Our coverage. Several other PPI services exist that we 
did not infiltrate. To get an idea of our coverage of the 
malware ecosystem, we compare our malware harvest 
with contemporary reports by the security industry. In 
July 2010, FireEye posted the list of the top 20 malware 
families they observed using their network during April— 
June 2010 [22]. Table 1 correlates these 20 families with 
the contents of our “milked” malware corpus for Au- 
gust 2010. The column labeled kit designates families 
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NAME % MONETIZATION KIT SEEN 

1 Palevo 7.50 DoS, Info stealer v v 
2 = Hiloti 4.69 Downloader/PPI v 
3 Zbot 3.62 Info stealer v v 
4  FakeRean 3.47 Rogue AV(s) v 
5  Onlinegames 2.94 Info stealer ? 
6  Rustock 2.66 Spam v 
7 ~~ Ldpinch 2.64 Info stealer v ? 
8 Renos 2.58 Rogue AV(s) 7 
9 Zlob 2.54 Rogue software v 

10 ~=Autoit 2.53 Downloader/PPI 

11 Conficker 2.48 Worm 

12. Opachki 1.95. Click Fraud v 

13. ~Buzus 1.91 Info stealer 

14 Koobface 1.17. Downloader 

15 Alureon 1.16 Downloader v v 

16 Bredolab 1.15. Downloader/PPI v v 

17‘ Piptea 1.13. Downloader/PPI v 

18 Ertfor 0.91 Rogue AV(s) v 

19 Virut 0.91 Downloader/PPI v 

20 Storm 2.0 0.80 Spam 


Table 1: FireEye’s top 20 malware families observed in 
their MAX Cloud network on the April-June 2010 time 
period [22] and whether we observe them in our milk for 
August 2010. 


that are crimeware kits, software that one can purchase 
and customize in order to build botnet variants. Each kit 
sold may represent an individual botnet with a separate 
owner. For popular kits such as zbot, many distinct bot- 
nets instances exist [33]. The column labeled seen indi- 
cates whether we see samples of the family in our milk- 
ing data. We milk 12 of the top 20 families, remain un- 
sure about the phylogeny of 3, and miss 5 (AutoIt, Buzus, 
Conficker, Koobface, Storm 2.0). We contacted FireEye 
to inquire about the 3 unknown families, and based on 
their response we believe they reflect generic tags used 
by AV vendors, rather than specific families of malware. 


3.2 “Milking” PPI Providers 


This section starts the description of our milking opera- 
tions. Figure 3 illustrates its architecture from milking 
the executables until their classification. 


PPI “milker” requirements. Each PPI service uses at 
least one downloader program. A PPI downloader has 
three main tasks to perform: download the client pro- 
grams, execute them, and communicate successful in- 
stallation to the PPI service for accounting. For each 
downloader used by a PPI service that we infiltrated, we 
built our own program that mimics the network com- 
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Figure 3: Architecture of our PPI milking system. The milkers contact the PPI services through Tor and store the 
executables for processing (®). We then use Bro to distill network traffic summaries from packet traces recorded for 
each sample’s contained execution (®). A behavioral classifier then processes these summaries and stores clustering 


and tagging results to a database (©). 


munication used by the downloader to obtain the client 
programs, but does not implement the rest of the down- 
loader’s functionality, namely executing the client pro- 
grams and accounting. In particular, we do our best to 
identify and avoid any accounting communication to pre- 
vent the PPI service from crediting an affiliate. We call 
such programs milkers because we use them to milk the 
client programs that the PPI provider distributes. 


Although each PPI downloader program uses a differ- 
ent method to download the client programs from the 
PPI service, we observe two large classes. Basic PPI 
downloaders use plain HTTP and have a set of hard- 
coded URLs supplying client programs. The downloads 
remain unencrypted and could be spotted easily by any 
network monitoring device. The LoaderAdv and one of 
the Goldinstall downloaders (GoldInstall-dl) belong to 
this class. Advanced PPI downloaders have a propri- 
etary, often encrypted, C&C protocol. These download- 
ers first contact the C&C infrastructure to receive the list 
of URLs supplying client programs. The Zlob, Virut, and 
an alternative GoldInstall downloader (GoldInstall-list) 
fall into this category. These downloaders still use HTTP 
for the downloads, at times encrypting the executables or 
disguising them as a benign file (e.g., by prefixing them 
with a fake GIF header). 


Building the milkers. Building a milker is most chal- 
lenging for downloaders using undocumented C&C pro- 
tocols and encryption routines. Our approach lever- 
ages previously proposed techniques for automatic bi- 
nary code reuse [5,16], which, given an executable, iden- 
tify and extract parts of the executable related to a given 
function or specific functionality defined by the analyst. 
Our milker building process is semi-automatic because 
we also manually decompile parts of the extracted binary 
code. The final milker uses a mixture of C source code 
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and assembly instructions. For this project, building and 
testing a basic milker required on average one day of full 
work, while the advanced milkers required from two to 
five days of work. It is worth noting that while build- 
ing and testing the milker it is important to minimize the 
amount of traffic exchanged with the real C&C servers, 
which the PPI administrators may monitor. We learned 
this the hard way when the Zlob PPI service banned one 
of our computers during the testing phase. Moving to a 
different IP address fixed the issue. 


Updating the milkers. All PPI services frequently 
change their download URLs to bypass blacklists. When 
a PPI service changes its download URLs, our advanced 
milkers simply download the updated list from the PPI 
C&C infrastructure and keep milking. However, our ba- 
sic milkers, which have the old download URLs hard- 
coded, stop working until we update the URLs. To up- 
date the download URLs for the basic milkers, we first 
develop network signatures for the basic PPI download- 
ers. Then, we use two different approaches. First, we use 
the network signatures to look for new PPI downloaders 
within the executables we milk. If we find a match, our 
processing automatically extracts new URLs and adds 
them to our basic milkers. In addition, we also periodi- 
cally query search engines and repositories that perform 
malware analysis [30] for any new traffic that matches 
the network signatures. Due to the prevalence of the PPI 
services in this study, we often find the new URLs in 
public repositories immediately after URLs change. 


Anonymity and geographical diversity. To provide 
anonymity and geographical diversity for the milkers, 
we route them, when possible, through Tor [31]. A 
milker achieves geographical diversity by using 15 Tor 
circuits in parallel, each circuit terminating in an exit 
node in a different country. We chose these countries 
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in accordance with different price points advertised by 
PPI providers. We verify with the MaxMind GeoIP 
database [20] that the exit node’s IP address indeed re- 
sides in the desired country. For GoldInstall, Loader- 
Adv, and Virut, we conduct all network communication 
through Tor. We cannot access Zlob through Tor. We 
suspect the Zlob operators blacklist the Tor exit nodes, 
which are publicly known. To achieve geographical di- 
versity for this provider, we run its milkers on Ama- 
zon’s EC2 cloud [9] from hosts in two different coun- 
tries, without using Tor. We discuss the targets and re- 
sults of geographically diverse milking in Section 4.4. 


3.3. Running the Executables 


We run each new milked executable under containment 
in the GQ malware farm [18], a platform for hosting all 
manner of malware-driven research in safe, controlled 
fashion. GQ confines each piece of malware in its ex- 
ecution by a custom, manually created containment pol- 
icy that allows us to decide per-flow whether to allow 
traffic to interact with the outside, drop it, rewrite it, or 
reflect it to other machines inside the environment. In our 
scenario, the malware family and behavior is completely 
unknown when we run a newly milked sample. Thus, we 
create a containment policy that allows us to run all of 
our samples safely, and to classify them based on their 
network traffic. 


We use this containment policy, called SinkAll, to au- 
tomatically run thousands of executables, fully unsuper- 
vised. This policy blocks network connections and redi- 
rects them to internal sink servers within the farm. The 
only traffic from the malware allowed on the Internet is 
DNS. The reason for allowing DNS is to try to get the 
malware sample to attempt C&C communication, since 
part of our classification process (Section 3.4) examines 
the traffic content. While our DNS sink server could 
simply reply to all DNS requests with a valid response 
that includes a fixed IP address, some malware sam- 
ples resolve benign domains (e.g., microsoft.com, 
google.com) and check the returned IP addresses 
against a hard-coded list in the malware. Thus, our DNS 
sink server proxies DNS requests and responses. If the 
DNS response is a failure, the sink server spoofs a suc- 
cessful DNS response with a fixed IP address to try to get 
the malware to attempt C&C communication. 


SinkAll forwards all non-DNS TCP traffic from the 
malware to internal sink servers. For some well-known 
protocols, e.g., HTTP and SMTP, these servers mimic 
a valid session. This is important because some mal- 
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ware samples will test connectivity first using these pro- 
tocols, and a valid session may entice them to attempt 
C&C communication. All other TCP traffic goes to a 
generic sink server that accepts arbitrary connections but 
does not provide a response; it simply completes the TCP 
handshake and accepts any data sent by the malware. 


Finally, to detect anti-virtualization capabilities, sam- 
ples that do not send any traffic are rerun on a bare (non- 
virtualized) host, also within the farm. (This did not of- 
ten make a difference in practice.) 


3.4 Classifying the Executables 


We classify executables based on the network traffic they 
produce. First, we manually cluster them based on traf- 
fic similarity and create a cluster signature. Then, when 
possible, we tag clusters with names used by the com- 
munity such as Rustock or Palevo. 


Each run of a malware sample in the farm produces a 
trace of its network communication. We process the net- 
work trace with the Bro intrusion detection system [24], 
using a number of custom analysis scripts we developed. 
The scripts first check whether the sample generated any 
network traffic at all. If it did not, then we queue the 
executable for running on a bare host to check for anti- 
virtualization techniques. If the sample did generate traf- 
fic, we extract a number of features to characterize the 
network traffic that we later use during clustering. 


The first feature is the list of protocols used by the 
sample. To extract this feature, we leverage Bro’s dy- 
namic protocol detection capabilities, which detects traf- 
fic for well-known protocols (e.g., DNS, HTTP, SMTP, 
and IRC), regardless of the port with which the commu- 
nication happens [8]. Another feature is the list of end- 
points that communicate with the sample. For this, we 
extract from the DNS traffic the domains requested by 
the sample. If the sample starts a connection without a 
previous DNS request, we also add the IP address it con- 
tacts to the list of end-points. Another feature is the list 
of TCP/UDP destination ports for connections started by 
the sample. Finally, we extract a content feature from the 
payload of any connection. For any HTTP request orig- 
inated by the malware, the content feature is the method 
and the list of parameters from the URL. We ignore the 
path in the URL and the parameter values because they 
tend to change often between samples. For other proto- 
cols, the content feature is simply the first 16 bytes sent 
by the malware. 


We use the extracted features for clustering executa- 
bles with similar network behaviors. In contrast to ex- 
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MILKER DOWNLOADS _ DISTINCT START DATE 








LoaderAdv 696,714 4,334 Aug 1, 2010 
GoldInstall 361,325 4,488 Aug 1, 2010 
Virut 4,841 72 Aug 1, 2010 
Zlob 504 259 Jan 3, 2011 
Total 1,060,895 9,153 


Table 2: Number of downloads and distinct MD5s col- 
lected from each PPI service, starting August 1, 2010 
and ending February 1, 2010. 


isting clustering systems for domain names [26], HTTP 
requests [25], and similar communication patterns [11], 
our system must accommodate any type of C&C, includ- 
ing custom binary protocols. In this work we therefore 
use our own, simple, clustering method, based primar- 
ily on manual inspection, but forsee integrating other ap- 
proaches as the need arises. 


Our clustering first groups all executables with identi- 
cal features into a single cluster, with the list of features 
acting as the initial cluster signature. We then manually 
merge similar clusters, assigning the new cluster a signa- 
ture of simply the disjunction of the signatures of each 
merged cluster. Using this process on the August 2010 
milk, we identify 57 clusters. The cluster signatures vary 
from a domain list—of limited value due to continual up- 
dates to C&C domains—to binary and HTTP signatures 
that prove more useful long-term. 


For tagging, we prioritize clusters by the total number 
of times we milked them. For each cluster we manually 
check if we can find labeled traffic that matches the clus- 
ter signature in public repositories and malware analysis 
reports. If so, we change the cluster tag to match the pub- 
licly available name. This process is painful due to the 
disparity of names used for the same families (and bina- 
ries) in the community. We were able to tag 35 of the 
57 clusters. In Section 4.1 we describe the results from 
our classification. 


4 Insights into the PPI Business 


We now present results from our infiltration by analyz- 
ing the executables we collected. We began our milking 
operations on August 1, 2010. As of February, 2011, 
we downloaded 1,060,895 client executables, yielding 
9,153 distinct binaries during approximately 6 months 
of infiltration. The modest proportion (0.8%) of unique 
executables arises due to our frequent milking, and the 
fact that our geo-diverse milking frequently retrieves the 
same executable from multiple locations. We began 


USENIX Association 





FAMILY MILKED DIsT. DAYS CLASS PPI 
Rustock 61,017 15 31 spam L 
LoaderAdv-ack 60,770 62 31 ppi L 
CLUSTER: A 11,758 8 31 clickfraud G 
Hiloti 10,045 43 31 ppi L 
CLUSTER: B 8,194 9 31 2G 
Gleishug 7,620 15 31 clickfraud L 
Nuseek 5,802 2, 30 clickfraud G 
Palevo2 16,101 21 29 botnet G,L 
Securitysuite 15,403 100 29 fakeav L 
Zbot 3,684 49 29 infosteal G,L 
CLUSTER: D 5,723 1 28 2G 
SmartAdsSol. 18,317 6 26 adware L 
Spyeye 4,522 16 25 infosteal G,L 
Securitysuite-avm 4,732 45 20 fakeav L 
Grum 2,974 54 20 spam G,L 
Tdss 4,893 12 19 ppi GL 
Otlard 677 7 16 botnet G,L 
Blackenergy1 1,135 15 15 ddos L 
Palevo 2,594 2 14 botnet G 
Harebot 1,617 13 14 botnet G,L,V 


Table 3: Top 20 malware families we milked during Au- 
gust 2010. The columns indicate the total number of 
executables milked, distinct executables per family, the 
number of days seen, the families’ general class, and 
PPI services that distribute the family: LoaderAdv (L), 
GoldInstall (G), Virut (V). 


our infiltration with LoaderAdv, GoldInstall, and Virut, 
adding Zlob in Jan. 2011. Table 2 shows the breakdown 
of our harvest by PPI service. The download rate varies 
across PPI providers since each PPI has a different num- 
ber of endpoints to download malware and our milkers 
access each through geo-diverse locations. 


4.1 Family Classification 


We developed a set of classification signatures and vetted 
them based on extensive manual analysis of the 313,791 
executables we milked during August 2010. These signa- 
tures classify 92% of the total August downloads. If we 
then apply these same signatures to milk from September 
2010, the proportion matched only diminishes to 86%, 
and for October 2010, 77%. Thus, in terms of classi- 
fying the most prevalent downloads, the power of such 
milk-derived signatures decays fairly slowly with time. 
(Certainly we do expect their power to diminish, how- 
ever, as PPI providers acquire new clients, and existing 
clients release variants of their malware that no longer 
manifest the behavior targeted by our signatures.) 
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For the 8% of August downloads unmatched by our 
signatures, we have assigned a general label reflecting 
absence of any generated traffic. We manually evaluated 
the behavior of 243 executables in this group and con- 
firmed that the executables appear corrupted and do not 
execute. We also ran most on bare hardware and con- 
firmed that their failure to execute does not reflect anti- 
virtualization checks. 


While our signatures work quite effectively for classi- 
fying the bulk of downloads, the picture changes if we in- 
stead consider distinct binaries (only 0.6% of the overall 
volume). For these, we classify only 36%. However, it 
is unclear that this latter figure holds much significance: 
a single malware specimen whose behavior we have not 
specifically classified can account for a large number of 
failures to classify distinct binaries if the specimen hap- 
pens to be repacked frequently. 


To examine the malware families distributed by each 
PPI provider, we limit our discussion to the August 2010 
milk. Since the distributed malware changes over time, 
focusing on a single month facilitates a clear presentation 
of our results, while still spanning a significant breadth 
of activity. Table 3 lists the top 20 malware families we 
milked during August 2010, the number of times milked, 
the number of distinct executables, the number of days 
we saw the family being dropped, the overall class for 
the family’s predominant activity (“botnet” represents 
generic malware platforms), and the different PPI ser- 
vices that distributed the family. 


Some of the malware families are crimeware kits 
(Palevo2, Spyeye, Zbot, Bredolab), which means they 
may be distributed by otherwise independent clients. 
When computing statistics for individual clients, we thus 
remove these kits to avoid potential aliasing. We observe 
that out of the 20 malware families, 7 are distributed by 
more than one PPI service. If we assume each (non-kit) 
malware family belongs to one actor, the results show 
that clients do not feel tied to a single PPI provider. 


Distribution over time. Figure 4 shows distribution 
timelines for each family we could label by activity class, 
for August 2010. We visualize availability continuously 
whenever a family was available at least once in three 
hours. We make several observations. Programs push 
clickbots at virtually all times, but DDoS platforms much 
more sporadically. The latter perhaps reflects some sort 
of Just-In-Time DDoS-for-hire service. With the ex- 
ception of the GoldInstall-list downloader, we see PPI 
downloaders pushed for weeks at a time. Spambots show 
no uniform availability pattern: relatively short-lived 
push-outs for Pushdo and Grum, but continual push-outs 
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DAYS TO REPACK 





FAMILY # DISTINCT MEAN MIN MAX 
Rustock 15 2.12 0.00 8.51 
LoaderAdv-ack 62 2.21 0.00 7.14 
CLUSTER: A 8 746 2.63 12.34 
Hiloti 43 0.76 0.00 2.58 
CLUSTER: B 9 4.42 0.34 23.62 
Gleishug 15 3.57 0.00 8.60 
Nuseek 2 14.08 5.04 23.13 
Palevo2 21 1.77. 0.00 10.15 
Securitysuite 100 0.37 0.00 1.17 
CLUSTER: D 1 = 28.22 28.22 28.22 


Table 4: Repacking rates for the 10 most-milked fam- 
ilies (Aug. 2010), excluding crimeware kits. The 
columns show the number of distinct binaries and the 
mean/minimum/maximum time to repack, in days. A 
minimum time of zero means that one of the distinct ex- 
ecutables appeared in only a single milking instance. 


for Rustock. In the PPI setting, botmasters can afford 
to push out their bots as convenient, which will keep 
the installs relatively “silent”; by contrast, propagation 
campaigns driven by social engineering (e.g., as used 
by Storm [17]) require more careful design and timing. 


4.2 Repacking Rate 


The rate at which malware distributors repack their prod- 
ucts reflects their concern about content-driven AV sig- 
natures. In this section we analyze the repacking rate 
for the client programs that we milk, which are typically 
repacked by the client themselves. In addition, we de- 
scribe how the Zlob service repacks their affiliate down- 
loader binaries on-the-fly. 


In the milk from August 2010, a malware family is 
repacked on average at least once every 11 days. Ta- 
ble 4 summarizes the individual repacking rate for the 
top 10 families (excluding crimeware kits) milked in Au- 
gust 2010. The data for the top 10 families shows that 
they are repacked on average every 6.5 days. This indi- 
cate that the top malware families are repacked more of- 
ten than the average malware family. Among these fami- 
lies, the most often repacked are Securitysuite (more than 
twice a day) and Hiloti (at least once per day). CLUS- 
TER:D has the slowest repacking rate, only | executable 
was seen during the month, followed by Nuseek (2 exe- 
cutables). 


In Figure 5, we contrast the repacking of the Rustock 
and Securitysuite families (with two variants of the lat- 
ter) over the course of August. We plot distinct vari- 
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Figure 4: Malware family availability via infiltrated PPI services in August 2010. We only show families with a known 
activity class. 
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Figure 5: Repacking activity according to binary changes over time for the Securitysuite and Rustock families. Some 
Securitysuite binaries detected virtualized execution; we separate these by color. 
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ants on the y-axis, with entries ordered by first appear- 
ance. Rustock changes executables less frequently, and 
with little version overlap. Furthermore, the program 
dropped Rustock during the whole month, while Secu- 
ritysuite has more complex availability: Securitysuite- 
avm, a Securitysuite subfamily with anti-VM capabil- 
ities (for VMware, specifically), filled the availability 
gaps when Securitysuite was not pushed out.! In aggre- 
gate, Securitysuite was thus likewise available through- 
out, though with differing anti-VM capabilities. One 
possible explanation is that the Securitysuite gang uses 
two off-the-shelf packers, but only one provides anti- VM 
capabilities. 


Zlob affiliate downloader repacking. Unlike for the 
malware that their clients provide, PPI providers typi- 
cally repack affiliate downloader binaries on a periodic 
basis and notify their affiliates to switch to the fresh 
downloader [29]. We found that the Zlob service has in- 
corporated a twist on this approach. They provide a web 
service for affiliates to request a fresh binary, which, in- 
terestingly, apparently repacks the affiliate binaries on- 
the-fly. We requested the downloader for a single affili- 
ate 27 consecutive times, resulting in 27 distinct, work- 
ing Zlob binaries with identical sizes but differing MD5 
hashes. Attackers could likewise apply such on-the-fly 
packing to other areas, such as drive-by-downloads, to 
create unique malware for each compromised host. 


4.3. PPI Behavior 


In this section we look at the behavior and distinct struc- 
ture manifested by each PPI provider for managing their 
downloads. 


LoaderAdv. The LoaderAdv downloader has hard- 
coded two domains and a set of file paths that it com- 
bines with the two domains to create the URLs to locate 
the malware executables. If we ignore the domain part 
of the URL (the second domain is only used for redun- 
dancy) we observe two classes of URLs: single-client 
and multi-client. Single-client URLs always return the 
same family of malware, while multi-client URLs cycle 
through a set of clients that changed over the course of 
our infiltration. These latter also yielded different down- 
loads based on the geo-location of the milker’s IP ad- 
dress, an aspect we examine further in Section 4.4. 


Figure 6 shows the behavior of a single multi-client 
URL as seen by our milkers. We show the different fami- 


‘Detecting the presence of Securitysuite-avm versus Securitysuite 
was the only significant identification we obtained by using our “bare 
metal” setup in addition to our VM-based execution environment. 
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lies in separate boxes, and the y-axis represents the coun- 
tries involved. (The gaps on August 5 and 11 arise due 
to failures of the milkers to connect through Tor.) As 
we milk binaries from this URL, we typically see Se- 
curitysuite or SmartAdsSolutions binaries. We also ob- 
tain Zbot for a brief 11-hour period and GoldInstall-list 
for about three days. During August 2010 our Loader- 
Adv milker downloads malware from a total of 19 unique 
URLs (ignoring the domains). Three of these are single- 
client URLs only serving Rustock, while the remaining 
16 drop malware matching 31 of our signatures. 


GoldInstall. GoldInstall has two downloaders. The 
GoldInstall-list downloader contacts the PPI C&C server 
to obtain a list of URLs hosting the client executables. 
The received list varies based on the geographic loca- 
tion. Goldinstall-dl has a hard-coded list of URLs in the 
binary that serve executables independent of geographic 
location. Both the GoldInstall-list and GoldInstall-dl 
downloaders fetch the executables using HTTP, with 
each distinct URL representing a single family of mal- 
ware. Often, the service hosts the same client executable 
in multiple locations, with the path components of the 
URL (such as 1.exe) remaining constant. When the 
path is the same, typically so is the family of malware, 
though we also observed common URL paths used for 
multiple families (e.g., bot .exe). The download lo- 
cations show no evidence of checking the geo-location 
of the downloader before serving malware. Thus, the 
GoldInstall-dl downloader does not download executa- 
bles based on geographic location. Throughout the 
month, the program periodically distributed new URLs 
to the PPI executable, 41 total. These on average con- 
tinued to return valid executables for 36 days after first 
provided by the C&C (maximum 162 days, minimum 14 
hours). 


Virut. The Virut downloader uses a custom IRC-based 
C&C protocol to receive a list of URLs hosting the client 
executables. We observe a total of six distinct URLs 
throughout August 2010, distributing 15 distinct executa- 
bles matching signatures for three families. Four of the 
URLs use a domain with the same whois entries as the 
Virut C&C, and each URL can return a different exe- 
cutable for each request. 


Zlob. The Zlob downloader uses a custom encrypted 
C&C protocol to request a list of URLs to locate client 
programs. The received list varies based on the geo- 
graphic location. The service replicates the list of URLs 
so that every two received URLs correspond to one exe- 
cutable, at two locations apparently for redundancy. 
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Figure 6: Availability of malware families over time, from a single LoaderAdv URL. The empty family shows when 


the URL provided a non-executable response. 


4.4 Geographic Breakdown 


To investigate the geographical preferences of the dif- 
ferent malware families, we analyze the milk from the 
LoaderAdv, GoldInstall, and Virut services, since as ex- 
plained in Section 3.2 for these three services the milker 
used 15 Tor circuits in parallel, each terminating in a 
different country. We selected 15 countries using price 
points advertised by PPI providers: AT, BR, DE, ES, FR, 
GB, GR, IT, JP, KR, NL, PL, PT, RU, and US. 


For most malware families we observe clear geograph- 
ical preferences. Figure 7 shows the frequencies with 
which we obtained a sample of the Ertfor, Gleishug, 
Rustock, Securitysuite, and SmartAdsSolutions families, 
each of which our milkers downloaded at least 100 times 
during August. We selected these groups to highlight 
characteristics we observe in geographical distribution; 
other families exhibit similar patterns. 


Three trends in geographical distribution emerge. 
First, we commonly see families of malware preferen- 
tially targeting Europe and the US (e.g., Ertfor, Secu- 
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ritysuite, and SmartAdsSolutions). Second, some fami- 
lies exclusively target the US or another single country 
(e.g., Gleishug). Finally, we observe families with no 
geographical preferences (e.g., Rustock). 


Several factors can influence a PPI client’s choice of 
country. First, the class of activity in which the client’s 
executable engages. A spam bot such as Rustock requires 
little more than a unique IP address to send spam, while 
fake AV such as Securitysuite often targets speakers of a 
specific language, and may need to support user payment 
methods specific to some areas. In addition, the install 
rate a client pays also varies depending on the targets’ 
countries. We find the US and Great Britain generally at 
the high end ($100-180 per thousand), other European 
countries in the middle ($20—160), and the rest of the 
world at the bottom (< $10) [12, 13, 19]. 


4.5 Affiliate—PPI Interactions 


Surprisingly, among the binaries that we milk we find a 
number of affiliate PPI downloaders. That is, download- 
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Figure 7: Prevalence of six malware families seen by our milkers from different country vantage points. 


ers not infrequently download other downloaders. This 
indicates that some PPI affiliates have also signed up as 
clients of PPI services. To understand these affiliate—PPI 
interactions, we extracted the unique affiliate identifier 
embedded in each of the PPI downloaders found in our 
milk, which we can observe from its transmission (pre- 
sumably for accounting purposes) during the C&C ex- 
change. 


Using these identifiers, we observe that affiliates from 
one PPI service themselves sometimes act as clients 
of other PPI services. This behavior manifests by our 
milker, impersonating affiliate X for PPI service A, fetch- 
ing an executable for installation that corresponds to a 
downloader for affiliate Y of PPI service B. 


We speculate that some of these multi-PPI-service af- 
filiates represent arbitrageurs who try to take advantage 
of pricing differentials between the (higher) install rates 
paid to the affiliates of one service for some geographical 
regions versus the (lower) install rates charged to clients 
of another PPI service. For example, we observe that 
LoaderAdv’s affiliate 701 signed as a client of GoldIn- 
stall, using the latter to distribute 701’s personalized 
LoaderAdv downloader for four days. Here, the price 
differential includes the US, Canada, and Europe, from 
which our GoldInstall milkers collected this executable. 


Perhaps even more surprising, we find affiliates from 
one PPI service who are also clients of the same PPI ser- 
vice. For example, LoaderAdv’s affiliate 515 distributed 
their personalized LoaderAdv downloader over Europe 
and Brazil using the LoaderAdv service for a total of 
20 hours. We see a similar behavior from affiliates 0625 
and gol of the GoldInstall service, both clients and af- 
filiates of GoldInstall. We conjecture that this happens 
when affiliates try to take advantage of the price differ- 
ential between the (higher) install rates paid to the af- 
filiates for some geographical regions over the (lower) 
install rates paid by the clients for installing on the same 
regions. Note that such price differential is possible be- 
cause the PPI service oversells installs: multiple clients 
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can pay the service for installs that cost the service only 
a single affiliate payout. We suspect the PPI service can 
detect this behavior would not credit both affiliates for 
the install. 


In a yet more convoluted case, we observed a GoldIn- 
stall affiliate, e4u, signing up as a client for both GoldIn- 
stall and LoaderAdv. We speculate that e4u most likely 
stands for “earning4u’’, the brand for the LoaderAdv PPI 
service at that time. (Presumably this affiliate simply 
took advantage of price differentials within the GoldIn- 
Stall service and with the LoaderAdv service, but possi- 
bly e4u in fact represents the LoaderAdv gang itself.) 


4.6 The Download Tree 


One important observation of our work regards how 
the nesting of downloaders-downloading-additional- 
downloaders can quickly grow strikingly complex. To 
capture such nesting we use a download tree. Nodes in 
the tree represent programs identified by hashes of their 
binary. At each branch in the tree, children represent 
programs installed by the parent. Figure 8 shows an ex- 
ample download tree. We term any node with children 
a downloader. Nodes with a single child may be spe- 
cialized downloaders for the child family, while nodes 
with multiple children may reflect PPI downloaders that 
charge the children for the installs. Leaf programs may 
implement any of a number of recognizable malware be- 
haviors, including sending spam, performing click-fraud, 
and stealing personal information. 


Generating the download tree requires carefully iden- 
tifying the dependencies between installed programs, 
e.g., which program downloads and executes other pro- 
grams. To build the tree in Figure 8, the client mal- 
ware programs need the freedom to download other exe- 
cutables from the Internet. For this experiment we used 
a different containment policy that sinks everything but 
HTTP and C&C. In addition, we rate-limited the outgo- 
ing HTTP and C&C traffic, and a human operator mon- 
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Figure 8: A download tree starting with a single Loaderadv downloader. Stripes indicate PPI service-related binaries. 


itored the execution in real-time to stop the process if 
anything unexpected happened. 


The download tree in Figure 8 comes from the live 
execution of (originally) a single LoaderAdv PPI down- 
loader that we ran in our controlled environment. Strik- 
ingly, the entire execution required under 10 minutes— 
with several additional leaf nodes omitted for clarity! 
Thus, the example illustrates how quickly an exploited 
system can transform from unmolested operation to host- 
ing a veritable ecosystem of malware. 


5 Discussion 


Our findings have a number of implications, as follows. 


Malware classification. Our work shows that we should 
conceptually separate the exploitation mechanism com- 
promising a system from the malware that the system 
subsequently hosts. For example, it may not make sense 
to characterize malware by its infection method beyond 
malware that self-propagates and malware that does not. 
Botmasters might simply purchase installation of their 
malware from PPI services which can use a variety of 
distribution methods. 


The installation of malware from multiple clients 
on a single target host has important implications for 
behavior-based malware classification. For example, 
when writing a malware analysis report it is easy to con- 
fuse a downloader with malware that it happens to install 
during one particular execution. Such confusion can then 
result in misleading statistics characterizing the preva- 
lence of malware families. Furthermore, malware anal- 
ysis platforms that execute malware with Internet con- 
nectivity [1,2,30] should carefully track program down- 
loads and their execution, to allow separation of each 
program’s runtime behavior. Without a download tree, 
behavioral reports may reflect the aggregate behavior of 
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multiple types of malware. These aggregate reports may 
result in incorrect classifications, and in the worst case 
the produced signature may fail to detect individually ex- 
ecuting malware. 


Regarding classification techniques, we note that our 
work does not aim to pursue advances in the field of be- 
havioral malware signature generation, and instead em- 
ploys straightforward techniques. We could fruitfully in- 
corporate much of the published research in this space 
into our classification approach. 


Defenses. As defenders, we need to understand and ap- 
preciate the threat posed by the “silent installs” industry. 
PPI services have direct implications for takedown ef- 
forts: even if defenders can completely clean up a botnet 
(as opposed to merely severing its C&C master servers), 
the botmaster could return to business-as-usual through 
modest payments to one or more PPI services. Given that 
multiple malware authors share use of the same PPI ser- 
vices, and that the number of PPI services seems to be 
significantly smaller than the number of malware fam- 
ilies, PPI services are good targets for future takedown 
efforts. However, the commoditization of the malware 
industry could make it easy to recreate PPI services else- 
where after takedown, so the focus should be on identify- 
ing and apprehending the people that run such services. 


Regarding detection techniques, we observe that the 
content-based features of our signatures perform better 
than the endpoint-based features. The former wins over 
the latter in our handling of the periodic replacement 
of stale URLs PPI services employ for hosting the mal- 
ware executables, likely to bypass URL blacklists. We 
also observe that many downloaders employ a simple 
download-and-execute strategy, which in turn suggests 
that defenders might realize significant protections by 
employing taint-based approaches that identify the ex- 
ecution of downloaded data. 
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Evasion. Infiltrating the PPI C&C protocols required 
significant reverse-engineering effort on our part. As 
miscreants become aware of this possibility and more 
parties launch infiltration attempts, adversarial evolu- 
tion will surely complicate this process. In particular, 
we expect PPI services to harden their C&C protocols 
with more robust use of cryptographic techniques and 
incorporation of anti-virtualization and triggering mech- 
anisms to increasingly hamper dynamic analysis. On the 
other hand, the fact that a relatively modest infiltration 
effort sufficed to gain insight into many of today’s top 
malware families is encouraging. Analysts should re- 
main on the lookout for opportunities to infiltrate core 
components of the modern malware ecosystem, which 
may offer broad insights into the malware landscape. 


6 Conclusion 


We have presented the results of the first systematic study 
of the pay-per-install (PPI) ecosystem, conducted by in- 
filtrating the malware distribution mechanism of PPI ser- 
vices. The ability to “milk” malware binaries directly 
from the source provides an unprecedented intelligence 
capability to defenders. We leveraged this approach 
to measure technical aspects of the market surrounding 
malware installation. 


Starting with a network-behavioral classification of a 
one-month corpus of 313,791 binaries, we identified 12 
of the 20 most prevalent families of malware. We illus- 
trated how infection with several clickfraud and fake-AV 
families specifically target the United States and Europe, 
while other malware classes, such as spam bots, are dis- 
tributed worldwide. Our examination of repacking rates 
of PPI-distributed malware showed that on average bina- 
ries are repacked every 11 days, with one family of mal- 
ware repacking up to twice a day. Finally, we illuminated 
the relationships among actors in the PPI ecosystem, in- 
cluding the identification of LoaderAdv and GoldInstall 
affiliates that apparently engage in pricing arbitrage by 
becoming clients to other PPI providers. 
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A Examples of Signatures 


This appendix provides a concrete view of several of the 
malware signatures that appear in Table 3. We include 
four popular-but-untagged clusters, and two versions of 
Palevo for reference. 


Each signature consists of three parts: URL, 
DOMAIN, and PAYLOAD statements followed by 
associated contents. The URL can contain regular 
expressions; we only use it for HTTP-based protocols. 
DOMAINS can list IP addresses or domains (with or 
without subdomains). PAYLOAD statements specify the 
parameters for where, type, contents, and len. 
where specifies the location to match, with “begin” 
meaning at the beginning of the payload. We use type 
to inform the engine whether to interpret the contents as 
a string or as an array of bytes. Finally, len restricts the 
length of the checked packet: a signature that specifies a 
len will only match if the packet has exactly the given 
length in bytes. 





CLUSTER: A 


URLS 
/svc.php\?ver= 


DOMAINS 
sy.perfectexe.com 
sy2.perfectexe.com 
sy3.perfectexe.com 





CLUSTER: B 


URLS 
/get.cgi\?.+ 
/data.cgi 


DOMAINS 
£19dd4abb8b8bdf2.cn 
2bf£2694930d2e21.cn 
697fe322c995dala.net 
89e3aaecc2bal734.net 
ade34ea82c4f7f2f.net 





CLUSTER: C 


DOMAINS 
ds.perfectexe.com 


URLS 
/active.asp\? [0-9] {2} 
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CLUSTER: D (URLs truncated for space) 


DOMAINS 
x.liruna.com 


URLS 

/x.ashx\? 

ashx\?a=getév= 
ashx\?a=[*&]+v=[*&]+&fid=[*&]+&id=... 





Palevo 


DOMAINS 
193.104.186.88 
76.76.99.186 
f£5v9w.com 
e7j0h7.cn 
mplr3n.ru 


URLS 
/hygtrve.exe 
/htrgef. 
/ntgref. 
/hybtvr. 


exe 
exe 
exe 


PAYLOAD 
where 
type 
contents 
len : 7 


begin, 
bytes, 
[ [0x61] ], 





Palevo2 


DOMAINS 
ff.fjpark.com 
fifa2012terra.com 
converter50.com 


URLS 

/vcip.exe, 
/adv.exe, 
/prr.exe, 


/usa.exe, /575.exe, 
/adv2.exe, /rip2.exe, 
/4757exe.exe 


PAYLOAD 
where: 
type 
contents 
len Zl, 


begin, 
bytes, 
[[0x18]], 
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Abstract 


Modern Web services inevitably engender abuse, as at- 
tackers find ways to exploit a service and its user base. 
However, while defending against such abuse is gener- 
ally considered a technical endeavor, we argue that there 
is an increasing role played by human labor markets. Us- 
ing over seven years of data from the popular crowd- 
sourcing site Freelancer.com, as well data from our own 
active job solicitations, we characterize the labor market 
involved in service abuse. We identify the largest classes 
of abuse work, including account creation, social net- 
working link generation and search engine optimization 
support, and characterize how pricing and demand have 
evolved in supporting this activity. 


1 Introduction 


Today’s online Web services—search engines, social net- 
works, and the like—create value for their users by help- 
ing them find and interact with content generated by 
other users. While these services typically rely on adver- 
tising for their revenue, their open access and reliance on 
user-generated content create powerful opportunities for 
abusers to fabricate secondary, extremely cheap advertis- 
ing channels as well. The result is well-known: Web-mail 
spam, polluted search results, “friend” requests from fake 
persons and so on. These activities are broadly termed 
service abuse: they exploit some feature of a public ser- 
vice for an attacker’s financial gain at the expense of the 
service provider. 

Each Web service provider aims to prevent such activi- 
ties and preserve the value of their advertising enterprise. 
To that end, most Web sites include extensive contracts 
declaring limits on the way their services may be used. 
However, implicit threats of legal action rarely deter at- 
tackers, and so the provider must rely on a broad range 
of defenses and countermeasures to enforce their terms 
of service. While the technical details of this “arms race” 
are themselves interesting, they are ultimately just symp- 
toms of this larger struggle over controlling who may 
monetize access to a site’s users. 

Thus, in this paper we do not focus deeply on the un- 
derlying technical attacks themselves, but rather explore 
the human labor markets in which these capabilities are 
provided. Though not widely appreciated, today there 
are vibrant markets for such abuse-oriented services and 
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in a matter of minutes, one can buy a thousand phone- 
verified Gmail accounts for $300 or a thousand Facebook 
“friends” for $26. Much of this activity occurs on free- 
lance work sites in which buyers “crowdsource” work 
by posting jobs they need done, and globally distributed 
workers bid on projects they are willing to take on.! 

There are multiple advantages in this approach. First, 
many anti-abuse countermeasures are designed to de- 
tect or deter mechanistic automation and can be by- 
passed through the use of low-cost human labor. Perhaps 
the best known example of this phenomenon is found 
in CAPTCHAs, human-solvable puzzles designed to be 
challenging for automated solvers. While these puzzles 
are specifically designed to prevent computer-based ser- 
vice abuse, we have previously documented how a robust 
CAPTCHA-solving marketing has emerged by aggregat- 
ing large amounts of cheap human labor instead [12]. 

A second advantage is that the crowdsourcing medium 
allows innovative attackers to quickly explore differ- 
ent schemes for evading anti-abuse defenses (due to the 
agility of a large contract labor pool). Finally, once a 
new attack scheme becomes sufficiently popular to com- 
moditize, competitive pressures naturally drive workers 
to develop the most efficient means of satisfying the 
demand. Indeed, eventually the most popular activities 
(e.g., CAPTCHA-solving or phone verified accounts) 
can support their own branded retail services outside the 
scrum of the spot labor market. 

In this paper, we characterize the abuse-related labor 
on Freelancer.com, one of the largest and most popu- 
lar freelancer sites. Using almost seven years of historic 
data, and a range of our own contemporary work solici- 
tations, we examine four classes of jobs: 


+ Account registration and verification, 
+ SEO content and link generation, 

+ Ad posting and bulk mailing, 

+ Social network linking 


Each of these represents a kind of service abuse, in- 
corporating manual labor to bypass existing controls, 
and each is ultimately a building block in some larger, 
economically-driven, advertising enterprise. 


'To be clear, while the majority of such work is legitimate— 
anything from corporate logo design to software development—a large 
minority serves the online service abuse ecosystem. 
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The rest of this paper is organized as follows. Section 2 
describes crowdsourcing and the Freelancer.com service 
in particular. In Section 3 we explain our methodology, 
the data we have gathered and the different categories 
of jobs in our study. Section 4 explains, as case studies, 
several components of the abuse value chain that have 
become semi-commoditized, followed by a characteri- 
zation of the Freelancer labor market in Section 5. Sec- 
tion 6 places these abuse activities into a larger, interre- 
lated context and Section 7 summarizes our findings. 


2 Background 


Outsourcing has long been a cost-cutting strategy in 
developed economies—pushing out key business pro- 
cesses to exploit the efficiencies or lower labor costs of 
third-party service providers. A more recent innovation 
is “crowdsourcing”, further unbundling labor from any 
structured organization and leveraging the broad connec- 
tivity provided by the Internet. In this model, individuals 
participate in the labor force as free agents, responding to 
open calls for work on a piecework basis. In many cases, 
crowdsourcing is built on free labor (e.g., for many con- 
tributors to open-source projects, or in von Ahn’s sem- 
inal ESP game [3]). However, fee-based crowdsourcing 
sites quickly emerged, the most famous being Amazon’s 
Mechanical Turk service. Using such services, employ- 
ers post requests for service at a particular price, while 
laborers in turn can “solve” the subset of requests that 
appeal to them. 

However, crowdsourcing also presents a number of 
concerns. First, as an employment vehicle, crowdsourc- 
ing is controversial, since critics claim its pure free- 
market approach to labor has the potential to be highly 
exploitative, particularly of those in developing coun- 
tries; one recent analysis estimates that the average 
hourly wage on Mechanical Turk is $5/hour [8]. More- 
over, even on the employer side of the equation, crowd- 
sourcing can be problematic since—absent any strong 
reputation mechanism—there may be little incentive for 
workers to provide quality work-products. Consequently, 
third-party services, such as crowdflower, have emerged 
that trade cost for data quality by replicating work re- 
quests and voting among them [1]. 

However, a less appreciated negative impact of this 
ecosystem is how anonymous access to cheap aggregated 
labor impacts the security of existing of Internet services. 
Indeed, as we show in this paper, the crowdsourced mar- 
ket for Web service abuse labor is thriving. 

Much of this activity takes place on “freelancing” 
sites, in which employers post jobs and select individ- 
ual workers based on their bids and bilateral negotiation. 
There are a large number of such sites with the most pop- 
ular being Freelancer, Elance, RentACoder, Guru, and 
oDesk. In this paper we specifically examine the activ- 
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ity at Freelancer.com, one of the oldest and largest sites, 
claiming roughly two million employers and workers [6] 
from 234 different geographic regions and with close to 
nine hundred thousand projects posted on the site since 
2004. We specifically chose Freelancer because the site 
offers an open API for querying information about past 
jobs and users. We have also gathered smaller amounts 
of data from most of the other large freelancer sites (i.e., 
via scraping) but since the activity is extremely similar 
across sites we chose to focus on the one for which our 
data was comprehensive. 

Visitors to Freelancer must register and select a handle 
by which they are visible to other users. The only due 
diligence concerning a user’s identity is a requirement 
to have a valid email address. The site does offer “skills 
tests” for a fee, by which individual users may demon- 
strate proficiency in various skills and earn “badges” vis- 
ible on their profile. There is no discrimination between 
employers and workers and any user can participate on 
either or both sides of the labor market. 

To post a project on the site, the project poster, or 
buyer, must pay a $5 fee, which is refunded once a 
worker is selected. Buyers may choose to pay an addi- 
tional fee of $14 to have their jobs “featured”, mean- 
ing that they are listed towards the top of the job list- 
ings. Workers independently scan these jobs listings to 
find projects matching their particular skill sets and then 
place bids (a combination of structured fields, such as 
dollar amounts, and freeform text). Buyers then select 
the workers who are most appropriate for their tasks. 

Once workers are chosen, Freelancer charges either $3 
or 3% of the total project cost to the buyers, depending on 
whichever amount is higher (Freelancer acts as the mid- 
dleman in the transaction, using online payment methods 
such as PayPal, Moneybookers and Webmoney). How- 
ever, some less scrupulous buyers are reputed to simply 
cancel their orders and settle with workers out-of-band. 
Finally, while job postings are effectively “broadcast”, 
there are a range of such posts that identify themselves 
as private by specifically identifying the workers they are 
interested in employing. 


3 Data Overview 


In this section, we describe our methodology for collect- 
ing data on Freelancer job activity and categorizing the 
jobs into various kinds of “dirty” tasks. 


3.1 Data Collection Methodology 


Freelancer.com exports an API for programmatically 
querying for information regarding projects and users. 
Using this API, we implemented a crawler to collect both 


2While space prevents a detailed description of the oversight and 
ethical considerations here, our protocols were reviewed by our Human 
Research Protections Program and we consulted with key brand holders 
in advance of any active purchasing activity. 
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Activity Count 
Projects 842,199 
Projects w/ Selected Workers 388,733 (46%) 
Project Bids 12,656,978 
Active Users 815,709 
Buyers Only 179,908 (22.1%) 
Workers Only 590,806 (72.4%) 
Buyer & Workers 44,995 (5.5%) 


Table 1: Summary of Freelancer activity between February 5, 
2004 and April 6th, 2011. 


contemporary and historical information about Free- 
lancer activity. We ran the crawler from December 16, 
2010 through April 6, 2011 to minimize load on the 
site. For historical data, we observed that Freelancer 
uses monotonically increasing IDs for both projects and 
users. To crawl all projects over time, we iterated through 
the entire available project ID space, which at the time 
ranged from 1—1,015,634. As a result, the job postings in 
our data set represent all of the jobs that were viewable 
through the API. We derived the set of user IDs based 
upon the set of projects, including any user associated 
with a project as buyer, bidder, or worker.* 

For all crawled projects, we extracted the project de- 
tails and the corresponding project bids, as well as the 
buyer, bidders, and selected workers who were awarded 
the projects (if any). For all users we encountered, we 
downloaded their public account metadata and feedback 
comments. 


3.2. Data Summary 


Starting with the earliest project posted on February 5, 
2004 at 12:28 EST, we collected data through April 6th, 
2011, capturing over seven years of activity. Table 1 sum- 
marizes this data set. During this time, 842,199 jobs were 
posted to Freelancer* and 815,709 users were active on 
the site. Roughly 46% of the posted jobs report a worker 
selected for the job. This number represents a lower 
bound on the number of job transactions; a buyer and 
a worker will sometimes use Freelancer to rendezvous, 
but will negotiate the transaction through private mes- 
saging and, thus, never report a selected worker. Among 
all users associated with at least one project, 22.1% were 
buyers only, 72.4% were bidders/workers only, and 5.5% 
served as both. 


Unlike projects, we did not exhaustively collect information for all 
two million users by crawling the user ID space since the majority of 
users do not appear to be active on the site. 

4Note the discrepancy of 173,435 jobs between the maximum ID 
and the number of postings we obtained through the API. When crawl- 
ing these IDs, Freelancer’s API returned an error indicating that the ID 
was invalid. We assume that invalid IDs are jobs that never existed or 
have been deleted—which, according to complaints, happens for only 
a select number of jobs that egregiously violate Freelancer rules. 
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Figure 1: Growth in Freelancer activity over time. Numbers in 
parentheses in the legend denote the largest activity in a month. 


Activity on Freelancer has grown steadily over time. 
Figure 1 shows the number of jobs offered and the num- 
ber of bids made per month, as well as the number of 
new buyers and bidders per month. To overlay and com- 
pare the curves, we normalized them to their maximum 
monthly value as listed in Table 1. The curves all show a 
drop in activity in December 2006 (the reason for which 
we have not been able to determine). After this point, 
Freelancer experiences strong linear growth in buyers 
and bidders and their associated posting and bidding ac- 
tivity. Freelancer’s job market is healthy and growing. 
Work posted by a steadily increasing number of buyers 
(5,000 new buyers a month on average in 2010) has been 
satisfied by an equally steadily increasing supply of bid- 
ders (15,000 new bidders/month). 


3.3 Categorizing Jobs 


Our first step in understanding Freelancer activity is to 
categorize the types of jobs found on the site. We use a 
two-step process for this categorization. We first manu- 
ally browse sampled projects to identify a meaningful list 
of job categories. We then use a combination of keyword 
matching and supervised learning to identify jobs from 
the entire Freelancer corpus that fall into the categories. 

From browsing random job postings, gauging the in- 
terest level in various tasks from observed bidding ac- 
tivity, and incorporating awareness of the larger under- 
ground cybercrime ecosystem, we identified 22 types of 
jobs falling into six categories. To establish a baseline of 
the prevalence of these types of jobs, we manually in- 
spected a random sample of 2,000 jobs, tagging each job 
with a category. 

Table 2 summarizes the list of categories and the dis- 
tribution of jobs that fall into each category from our ran- 
dom sample. Note that a job may be tagged under multi- 
ple categories; for example, social bookmarking jobs for 
search engine optimization (SEO) usually also require 
account creation. Legitimate projects comprise 65.4% of 
these jobs and primarily involve Web-related program- 
ming and content creation tasks. We include private jobs, 
corresponding to projects targeted to one specific user, in 
the legitimate job class, since we typically do not know 
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Category Job Type Description Count % 
Legitimate [§A.1] Web Design/Coding Create, modify, or design a Web site 769 38.5 
Multimedia Related Complete multimedia-related task (e.g., Flash) 265 = 13.2 
Private Jobs Jobs designated for a particular worker 138 6.9 
Desktop/Mobile Applications Create a desktop or mobile application 100 5.0 
Legitimate Miscellaneous Miscellaneous jobs 177 8.8 
Accounts [§A.2] Account Registrations Create accounts with no defined requirements 22 1.1 
Human CAPTCHA Solving Requests for human CAPTCHA solving 19 0.9 
Verified Accounts Create verified accounts (e.g. phone) 14 0.7 
SEO [§A.3] SEO Content Generation Requests for SEO content (e.g., articles, blogs) 195 9.8 
Link Building (Grey Hat) Get backlinks using grey hat methods 53 2.6 
Link Building (White Hat) Get backlinks using no grey/black hat methods 20 1.0 
SEO Miscellaneous Nonspecific SEO-related job postings 61 3.0 
Spamming [§A.4] Ad Posting Post content for human consumption 25 1.2 
Bulk Mailing Send bulk emails 8 0.4 
OSN Linking [§A.5] | Create Social Networking Links _ Get friends/subscribers/fans/followers/etc. 33 1.7 
Misc [§A.6] Abuse Tools Tools used for abuse (e.g., CAPTCHA OCR) 41 2.1 
Clicks/CPA/Leads/Signups Get clicks, emails, zip codes, signups, etc. 32 1.6 
Manual Data Extraction Manually visit websites and scrape content 21 1.1 
Gather Email/Contact Lists Research contact details for targeted people 17 0.9 
Academic Fraud Write essays, code homework assignments, etc. 10 0.5 
Reviews/Astroturfing Create positive reviews 1 0.1 
Other Malicious Miscellaneous jobs with malicious intentions 35 1.8 


Table 2: Distribution of 2,000 random, manually-labeled projects into job categories. Referenced sections of the appendix include 


examples of jobs in the corresponding category. 


the job details; private postings, however, will sometimes 
contain enough data to determine their intent. In our 
manually labeled corpus, we were unable to determine 
the intent of 5.4% of the jobs. The remaining 29.2% of 
the jobs correspond to various kinds of “dirty” jobs, rang- 
ing from delivering phone-verified Craigslist accounts 
in bulk to a wide variety of search-engine optimization 
(SEO) tasks. 

We then focused on identifying jobs in the entire Free- 
lancer corpus that fall into “dirty” categories. Since we 
could not manually classify all jobs, we used keyword 
matching to generate training sets and supervised learn- 
ing to train classifiers for each category. We then applied 
the classifiers to each job to determine the dirty category 
it falls into, if any. 

To find positive examples for each classifier, we used 
keywords associated with the job type to conservatively 
identify jobs that fall into each category. For example, 
to locate jobs about CAPTCHA solving, we searched 
job postings for the terms “CAPTCHA” and “type” or 
“solve”. For negative examples, we randomly chose jobs 
from the other orthogonal job types. For features, we 
computed the well-known ¢f-idf score (term frequency- 
inverse document frequency) of each word present in the 
title, description, and keywords associated with jobs in 
the training sets. We then used svm-light [9] to train clas- 
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sifiers specific to each category. 

Table 3 shows the results of applying these classifiers 
to all Freelancer jobs. We focus on just those dirty job 
categories that had at least 1,000 jobs. Although the clas- 
sifiers are not perfect (e.g., some jobs placed in the “link 
building” categories might be better placed in the more 
generic “SEO” category), they sufficiently capture the set 
of jobs in each category and greatly increase the number 
of jobs we can confidently analyze. Note that we did not 
attempt to be complete in the categorization of the post- 
ings: there are likely jobs that should be in a category 
that we have missed. However, such jobs are also likely 
not well-marketed to workers, since they most likely lack 
the typical keywords and phrases commonly used in jobs 
under those categories. 

We focus on the jobs comprising these categories in 
the analyses we perform in the subsequent sections. 


3.4 Posting Job Listings 


Pricing information is a crucial aspect of our study, since 
it represents the economic value of an abusive activity to 
attackers. Both job descriptions and bids contain pricing 
info, often at odds with each other. To determine which 
source of pricing info to use, we performed an experi- 
ment where we posted jobs on Freelancer and solicited 
bids. In the process, bidders posted public bids and, in 
some cases, sent private messages to our user account. 
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Class Job Type Count % Class Job Type Bids Cost 
Accounts Account Registrations 6,249 0.7 Accounts Craigslist PVA 10 (4) $4.25 
Human CAPTCHA Solving 4,959 0.6 [$B.1] Gmail Accounts 6 (5) $0.07 

Verified Accounts 3,120 0.4 Hotmail Accounts* 21(12)  $0.007 

SEO SEO Content Generation 72,912 8.7 Facebook Accounts* 24 (10) $0.07 
Link Building (Grey Hat) 16,403 1.9 SEO Blog Backlinks* 10 (5) $0.30 

Link Building (White Hat) 10,935 1.3 [$B.2] Linking (White Hat)* — 17 (8) $0.81 

Spamming Ad Posting 11,190 1.3 Forum Backlinks 12 (9) $0.50 
Bulk Mailing 3,062 0.4 Social Bookmarks* 44 (21) $0.13 

OSN Linking OSN Linking 11,068 1.3 Bulk Article Writing — 29(23) ‘$3.00 
Spamming Bulk Mailing 10 (5) 0.075¢ 

Table 3: Freelancer jobs categorized using the classifiers. [$B.3] Craigslist Posting 10 3) $0.60 
OSN Facebook Friends* 11 (4) $0.026 

These private messages occasionally reveal the external Pane — ci 2) ve 
: : [$B.4] MySpace Friends 2 (2) $0.037 

Web store fronts operated by Freelancer workers, in addi- Twitter Followers® 7(6) $0.02 


tion to the tools, services, and methods they use to com- 
plete each type of job. We posted 15 job listings repre- 
sentative of the categories for which we have classifiers. 
We also randomly posted half of the jobs as a “featured” 
listing to determine whether this increased the quantity 
of bids we received (which it did). 

Table 4 summarizes the results of our job posting ex- 
periments. Of the 228 total bids we received, 47 were 
commensurate with market rates for these projects. Most 
of the remaining bids, however, were simply minimum 
bids used as “place holders”. The actual bid amount was 
either included in a private message to our buyer account, 
or the bidder provided an email address to negotiate a 
price outside of the Freelancer site to avoid the Free- 
lancer fee. 

Because many prices in the public bids severely un- 
derestimate market prices, we use the prices in job de- 
scriptions by buyers in our studies in Section 4. Even 
so, we note that the pricing data has some inherent bi- 
ases. They are advertised prices and not necessarily the 
final prices that may have been negotiated with selected 
workers. Further, we use prices that were systematically 
extracted from the job descriptions. Even with hundreds 
of hand-crafted regular expressions, we were only able 
to extract pricing data from about 10% of the jobs. Job 
descriptions are notoriously unstructured, ungrammati- 
cal, and unconventional. These biases notwithstanding, 
the pricing data is still useful for comparing the relative 
value of jobs, as well as trends over time. 


4 Case Studies 


This section features case studies of the four groups of 
abuse-related Freelancer jobs summarized in Table 2. 


4.1 Accounts 


Accounts on Web services are the basic building blocks 
of an abuse workflow. Because they are the main mech- 
anism for access control and policy enforcement (e.g., 
limits on number of messages per day), circumventing 
these limits requires creating additional accounts, often 
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Table 4: Results from posting job listings to Freelancer. A “*” 
indicates the post was featured, the number within the “()” is 
the number of bids that included prices. All prices in the cost 
column are for the smallest unit of service (i.e., per one account, 
backlink, email, post, and 500-word article). 


at scale. Thus account creation has become the primary 
battlefield in abuse prevention. 

Accounts primarily enable a wide variety of spamming 
and scamming. For Web mail services like Gmail and Ya- 
hoo, spammers use accounts to send email spam, taking 
advantage of the reputation of the online service to im- 
prove their conversion rate. For online social networks 
like Facebook and Twitter, spammers use accounts to 
spam friends and followers (Section 4.2), taking advan- 
tage of relationships to improve conversion. For classi- 
fied services like Craigslist, spammers use accounts to 
create highly-targeted lists, post high-ranking advertise- 
ments for a variety of scams, recruit money laundering 
and package handling mules, advertise stolen goods, etc. 
Further, accounts on some services easily enable paired 
accounts on related services (e.g., creating a YouTube ac- 
count from a Gmail account), further extending the op- 
portunities for spamming. 


4.1.1 Account Creation Insights 


In the context of another research effort, we obtained 
approval from a major Web mail provider to purchase 
fraudulently-created accounts on their service. We pur- 
chased 500 such accounts from a retail site selling ac- 
counts, gave them to the provider, and in return received 
registration metadata for the supplied email accounts, in- 
cluding account creation times and the IP addresses used 
to register the accounts. We later discovered that the sup- 
plier we contacted was a very active member of Free- 
lancer.com; this worker is responsible for account set IN, 
in Table 5. 

The supplier had bid on 2,114 projects, had been cho- 
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Name Rating Tested Valid(%) Age (Days) 
IN,* 9.8 500 100.0 0.4 
UK, 9.9 3,500 99.9 25.7 
BD, 10 6,999 99.6 24.7 
IN 9.8 5,015 99.6 9.7 
PK, 10 4,999 99.4 78.6 
PKo 9.8 4,000 95.4 82.6 
PK3 9.9 4,013 773 414.7 
IN3 9.9 6,200 76.2 30.7 
CA,** 9.6 508 15.7 21.7 


Table 5: Summary of the results from purchasing email ac- 
counts. The names of the account sets embed the worker 
countries: IN is India, UK is the United Kingdom, BD is 
Bangladesh, PK is Pakistan, and CA is Canada. The rating col- 
umn refers to the average rating of the selected worker. Notes: 
*We purchased IN, in 2010, the rest in 2011. **The worker 
responsible for CA; repeatedly copied and pasted 508 accounts 
to meet the 5k requirement. 


sen as a Selected worker on 147 projects, and served as 
a buyer on 84 projects. Interestingly, the supplier acted 
as a buyer for 25 jobs that involved the creation of other 
Web mail account types. The supplier contracted out this 
task at a rate of $10-20 per 1,000 accounts, and yet the 
supplier charged $20 per 100 accounts on the retail Web 
site, an order of magnitude more. 

The accounts we purchased were created an average 
of only 2.8 seconds apart, suggesting the use of either 
automated software or multiple human account creation 
teams in parallel.> Such automation would be one way 
to earn money bidding on account jobs for this particular 
worker. Further, 81% of the IP addresses used to register 
the accounts were on the Spamhaus blacklist, suggesting 
the use of IP addresses from compromised hosts to defeat 
IP-based rate limiting of account creation. 


4.1.2 Experience Purchasing Accounts 


In 2011, we commissioned a job to purchase additional 
email accounts for the same Web mail provider in quan- 
tities ranging from 3,500-7,000. We selected nine dif- 
ferent workers, of which eight ultimately produced ac- 
counts, listed in Table 5 after IN;. Once given the ac- 
counts and the corresponding passwords, we logged into 
the accounts and downloaded the newest and oldest in- 
box pages (assuming the account was valid). Table 5 
shows the results of the purchasing and account check- 
ing. Of the eight email sets, seven consisted of largely 
valid accounts, with over 75% of the tested email ac- 
counts yielding a successful login. IN3 was particularly 
interesting; the worker previously used the email ad- 
dresses to create Facebook and Craigslist logins and 
posts, then resold the accounts to us. Also, four of the 


5We know that an effective automated CAPTCHA solver existed at 
this time for this Web mail provider, so automation is the likely suspect. 
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Figure 2: Median monthly prices offered by buyers for 
1,000 CAPTCHA solves (top) and the monthly volume of 
CAPTCHA solving posts (bottom), both as functions of time. 
The solid vertical price bars show 25% to 75% price quartiles. 
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account batches are relatively old (as determined by the 
date of their oldest emails), with the median age of the 
accounts between two months and over one year. These 
ages indicate that workers are likely sitting upon a stock- 
pile of email accounts. Lastly, the worker ratings do not 
seem to reflect the quality of the accounts, as demon- 
strated by the high ratings (out of 10) achieved by those 
workers responsible for the PK3, ID3, and most notably, 
CA, account sets. 


4.1.3 CAPTCHA Solving 


To keep the barrier to participation extremely low, cre- 
ating an account at an online service today requires lit- 
tle more than solving a CAPTCHA. CAPTCHAs are de- 
signed to be hard to solve algorithmically, and thus create 
an obstacle to automating service abuse. In response to 
their widespread deployment, human-based CAPTCHA- 
solving services emerged in abuse ecosystem. Such ser- 
vices depend on cheap human labor to provide a sim- 
ple programmatic interface for solving CAPTCHAs to an 
otherwise completely automated abuse processes chain. 
In a previous study [12], we described a robust retail 
CAPTCHA-solving industry capable of solving a million 
CAPTCHAs a day at $1 per 1,000 solved. Thus today, 
CAPTCHAs are neither more nor less than a small eco- 
nomic impediment to the abuser, forming the first step in 
the account value chain. 

By their nature, CAPTCHAs are ideally suited to the 
Freelancer outsourcing paradigm, and indeed the Free- 
lancer marketplace has played a key role in the evolu- 
tion of CAPTCHA solving. Figure 2 shows the history of 
prices offered for CAPTCHA solving as well the demand 
(in number of job offers per month) since 2007. We see 
arise in demand starting from their first appearance, and 
a corresponding drop in prices to the $1 per 1,000 price 


USENIX Association 







































































2% [ Craigslist 67% 
2% | PayPal 5% 
7%) Facebook 8% 
4% ay eBay []o% 
a%[ | Twitter or 
T% MySpace 1% 
9% Hotmail ‘| 3% 
8%| | + YouTube 1% 
12% =a Yahoo! 1% 
v%f tst*é=<=s*~*é‘~;~*sSSCS*‘C mal 7% 











Basic accounts Verified accounts 


Figure 3: Sites targeted in account registration jobs. 


seen today, corroborating our previous findings [12]. 


4.1.4 Account Verification 


Because creating a basic account—even one requiring 
solving a CAPTCHA—is so cheap, to curb online abuse 
services must necessarily take advantage of some limited 
resource available to a user. To increase the limits placed 
on a basic account, a user must sometimes undergo ac- 
count verification, which takes a variety of forms (e.g. 
phone numbers, credit cards, etc.). Verification increases 
the user’s standing within the service, giving the account 
holder greater access to the service and thereby increas- 
ing the value of the account. For this reason, verification 
is a step in the value chain of many abuse processes. 

The most popular type of verified account uses phone 
verification. Beyond the steps for creating a basic ac- 
count, phone-verified accounts (PVAs) require a work- 
ing phone number as an additional validation factor in 
account authorization. Services will either call or mes- 
sage a code to the number, and the user must submit 
the number back to the service to complete authoriza- 
tion. For some services phone verification is mandatory 
(e.g., for posting advertisements in certain forums on 
Craigslist, creating multiple accounts in Gmail from the 
same IP address), and for other services, phone verifi- 
cation adds convenience (e.g., avoids CAPTCHAs with 
Facebook). Services typically require the phone number 
to be associated with a landline or mobile phone since, 
unlike VoIP phone numbers, it is much more difficult 
to scale the abuse of such numbers. Phone verification 
is effective: immediately after Gmail introduced phone 
verification to limit account abuse, for instance, prices 
for Gmail accounts on underground forums skyrocketed 
to 10 times other Web-mail accounts [2]. However, even 
more so than CAPTCHAs, PVAs add further delay and 
inconvenience to users and is the primary reason why ser- 
vices do not use phone verification uniformly. 


4.1.5 Web Services Targeted 


Figure 3 shows the distribution of services targeted in job 
postings for basic and verified account registrations. For 
ease of comparison, it shows the top 10 targeted services 
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Craigslist introduced phone verification for erotic services ads 
(March 2008) and other services (May 2008). 


for both kinds of accounts, combined. For a job targeting 
multiple services, we count it in the total for each service 
mentioned. Job postings target accounts in every major 
category of Internet service: Web mail, social networks, 
as well financial and marketplace services. However the 
distribution of specific services differ markedly between 
the two types of account registration jobs, reflecting how 
services vary in their deployment of additional verifi- 
cation mechanisms (if any). Basic accounts are useful 
for many purposes, including obtaining accounts up for 
other Internet services (Facebook, Craigslist, etc.), and 
Gmail is by far the most popular. When it comes to veri- 
fied accounts, on the other hand, Craigslist is the dom- 
inant target, most certainly because Craigslist sections 
targeted by spammers all require PVAs. 

We posted a job soliciting bids for “CraigsList Phone 
Verified Accounts PVA” on Freelancer.com. Of the 10 
bids we received, 4 contained prices: $3, $4, $4.50, and 
$6. These prices are consistent with the currently ob- 
served buyer offers for Craigslist PVAs. The pricing of 
PVAs tells us in monetary terms the value of phone ver- 
ification as a security mechanism. For Craigslist, PVAs 
have made account abuse extremely expensive. In con- 
trast, retail services sell Gmail PVAs for around 25¢, a 
10-20 fold price difference compared to Craigslist. 


4.1.6 Trends 


Demand for accounts through Freelancer grew dramati- 
cally starting mid-2008. Figure 4 shows the number of 
account creation jobs posted over time. Demand for ba- 
sic accounts steadily increased through mid-2008, then 
dramatically increased until it peaked in mid-2009. 
Demand for verified accounts rose greatly when 
Craigslist introduced phone verification for the erotic ser- 
vices section of their site in early March 2008 [4]. De- 
mand grew steadily until about October 2009, and then 
dropped. We extracted prices from the Craigslist post- 
ings, and observed that Craigslist PVAs first rose to $4 
by the end of 2008 and then settled around $2. In Octo- 
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ber of 2009, prices spiked to more than $5, then hovered 
between $2 and $3 through 2010. 

For both types of accounts—basic and verified— 
demand dropped during 2010. We do not know the cause; 
however we suspect this may be due to stricter policing 
on behalf of Freelancer.com; our own price solicitation 
for Craigslist posting was canceled by the site. 


4.2 OSN Linking 


Online social networking links can be abused in two 
ways: (1) as a communication channel to market to 
real users, which is a finished product ready to directly 
monetize; (2) as an intermediate product to increase the 
reputation—and thus influence—of accounts by adding 
social links to other fake accounts. Previous work has 
shown that online social networking spam has a higher 
click-through rate than traditional email-based spam [7]. 
Thus, OSN platforms have emerged as a lucrative mar- 
keting venue where spammers are exploiting the trust re- 
lationships that exist in social networks to improve their 
conversion rates. However, it is difficult for a spammer 
to contact users on a social networking site until they 
have established a social link with real users. These so- 
cial links take many different forms, depending on the 
targeted social networking site, such as convincing a user 
to friend the spammer, follow a spammer’s Twitter feed, 
become a fan of the spammer’s page, or subscribe to 
the spammer’s YouTube channel. Building social links 
to real users is analogous to gathering email addresses 
that will later by monetized with email spamming. Once 
this social link is established, the spammer has a com- 
munication channel that is both highly reliable and not 
subject to aggressive filtering. 

Adding fake social links is a relatively inexpensive 
method for increasing the reputation of an account, 
which in turn presumably improves the success rate of 
establishing links to real users. This method is effective 
because people are more willing to establish or accept 
social links that are more popular in terms of the number 
of previously-established social links or other endorse- 
ments. If the account has many social links and, more 
importantly, if mutual social links exist, the likelihood 
increases that the targeted real user will establish or ac- 
cept a social link with the spammer. 

In this section we survey the Freelancer.com market 
for buying both real and fake bulk social links. 


4.2.1 Characterization 


There are two main categories of social networking links 
requested in jobs. The first are friendship relationships 
(e.g., MySpace and Facebook friends), where an active 
invitation is offered and, if accepted, targeted messages 
can then be delivered to a user’s private inbox. The sec- 
ond are subscription relationships (e.g., Facebook fans, 
Twitter followers, YouTube subscribers) where, if a user 
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Figure 5: Number of job postings for social networking links. 


can be induced to follow a spammer’s account, messages 
will appear in a user’s feed; depending on the site, the re- 
lationship also grants the ability to send private messages 
to the user. A closely related goal is to use social links 
to increase the perceived popularity of an object. Exam- 
ples of this type of task are increasing the view count of 
YouTube videos, or digging links on Digg. We group all 
these jobs into the category of social network links and 
they all follow the form of increasing the reputation of 
an account/object or establishing a marketing channel to 
real users. 

Jobs for bulk social link building range from a few 
hundred to hundreds of thousands of links. Typically 
jobs interested in acquiring fake social links will re- 
quest a relatively small number of links spread out over a 
large number of accounts (e.g., add 500 friends to 50 ac- 
counts). The requests for social links to real users often 
specify a target demographic for the links, thereby ex- 
ploiting the same targeted marketing potential of using 
information included in a profile that legitimate advertis- 
ers on these sites also use to improve ad targeting. For 
example, a job might require that most social links be to 
male accounts in the US over the age of 18. The most tar- 
geted geographic demographics are high-income English 
speaking countries including the US (46%), UK (13.2%), 
Canada (9.5%) and Australia (6.2%). Also, based on key- 
word searches, females are specifically targeted in 8% of 
jobs and males in 3% of the jobs. 


4.2.2 Trends 


Figure 5 shows the demand over time for job postings for 
social networking links. Overall demand for social links 
has skyrocketed since the early part of 2010, suggesting 
that spammers have only recently realized the potential 
for monetizing social links. The social networking sites 
with the largest English-speaking user bases (Facebook, 
MySpace, Twitter, and YouTube) are targeted by 97% of 
the job postings for social links. Over 50% of social link 
jobs included words such as “real” and “active” indicat- 
ing that they were seeking to buy a more finished type of 
social link that could be directly spammed. This percent- 
age is a lower bound, however, as it is unclear how many 
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Name _ Rating Links US IN BD PH 
BD2 98 1,034 26.2 13.8 5.9 7.7 
BD3 9.8 1,081 43.3 74 325 44 
BDa 84 1,063 74.5 0.3 25.2 — 
BDs 10. —:1,071 — — 100 — 
BDg 10 =1,145 60.0 8.7 8.4 5.3 
BD7* 9.8 555 30.6 104 106 8.4 
IN 99 1,095 643 25.1 105 — 
MY, 98 1,110 99.1 — — 0.1 
PKa - 1,015 24.7 9.2 5.9 7.0 
RO1 10 =1,058 31.8 11.0 8.8 84 


Table 6: Summary of the social links purchased to pages for our 
custom Web sites. The names of the sets correspond to the se- 
lected workers’ home countries, while the rating column refers 
to his or her average rating. The worker responsible for BD7 did 
not complete the job in a timely manner. Country codes: BD — 
Bangladesh, IN — India, RO — Romania, MY — Malaysia, PK — 
Pakistan. 


postings did not include these types of words but were 
actually seeking real social links. 

Overall the median offered price in posts were $0.01 
per social link, and median bids were between $0.02— 
0.03 per a social link. These prices were similar across 
all of the social networking sites. This low price point 
raises the interesting question of whether proposed de- 
fenses that mitigate Sybil attacks via analysis of social 
link structure [14, 15] might be vulnerable to adversaries 
that are willing to simply hire humans to create real so- 
cial links. 


4.2.3 Experiences Purchasing Social Links 


In preparation for purchasing social links, we instanti- 
ated several Web sites on the topic of cosmetics consult- 
ing [16] and created separate “pages” about each site on 
a popular social networking service. We then commis- 
sioned a job to obtain one thousand social links for these 
pages. The posted job explicitly targeted users from the 
US, Canada, and the UK. We assigned the task to 10 dif- 
ferent workers, each given a different Web site to target. 

Table 6 shows the results of this task. The name of 
the sets correspond to the selected workers’ home coun- 
tries, and the links column is the maximum reported daily 
number of social links. Most of the workers delivered 
the required number of social links in a timely manner 
(except for the BD7 set); the quality of the social links, 
however, was quite poor. Most of the workers did not de- 
liver social links from users that met our specifications, 
particularly in regards to user countries. Also, several of 
the workers added social links at a rapid pace, with some 
jobs being completed in as few as two days. Next, we ob- 
served substantial overlap between the users linked to our 
target pages, shown in Figure 6. As many as 50% of the 
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Figure 6: The number of user accounts common to each pair of 
workers hired to create social links. Labeled solid lines indicate 
at least 100 user accounts (out of 1,000 requested) in common, 
dashed lines indicate at least 10 but fewer than 100 user ac- 
counts in common. Work performed by MY,, PK4, and BDz was 
done in April, while the remaining jobs were done roughly a 
month later. 
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Figure 7: Median number of friends vs. median number of page 
social links for the sets of users linked to our websites. 


users (between IN, and BDz, for example) overlapped. 
This overall suggests that the workers are all manipulat- 
ing the same set of users to produce these social links, 
or even perhaps subcontracting out the task to the same 
groups of workers. Only one worker, responsible for MY,, 
had no overlap with any of the other sites. Again, the 
selected worker ratings do not reflect the quality of the 
delivered products; we posit that buyers who hire these 
workers find it difficult to evaluate social link quality. 

Next, we extracted the profiles for the OSN users who 
were linked to our target Web sites, and looked at the 
number of friends and page links listed on their profiles. 
Figure 7 shows a scatterplot of the median number of 
friends versus the median number of page links for these 
OSN users. Several clusters emerge in the graph. Within 
each user batch, we manually visited the profiles of those 
users; only one worker, MY,, appears to have delivered 
social links from legitimate users. The rest used predom- 
inately fake accounts, many of which had few friends and 
a large number (>1,000) of page social links. 


4.3 Spamming 


In our study, we consider spamming to be the dissemina- 
tion of an advertiser’s message to users by means other 
than established advertising networks. Spamming pro- 
vides the buyer with a direct marketing channel to his 
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targets, and as such, represents one of the most finished 
commodities in the advertising value chain.° 

In our survey and classifier-based labeling (Tables 2 
and 3), the class of spamming jobs is comprised of ad 
posting and bulk mailing.’ Because Craigslist is the main 
target of ad posting jobs (82%), we treat it separately. We 
begin by first analyzing the pricing data for bulk mailing. 


4.3.1 Bulk Mailing 


Bulk mailing is simply traditional email spam and rep- 
resents 0.3-0.4% of all jobs posted on Freelancer.com. 
In most cases, the buyers supply their own mailing lists, 
although some—generally targeting larger volumes— 
expect bidders to supply their own address lists. 

We extracted pricing data from the job descriptions 
of 236 postings. We averaged these prices and discov- 
ered that buyers on Freelancer.com were willing to pay 
approximately $5.62 to send 1,000 emails, with a me- 
dian price of $1.00. The extracted prices varied wildly; 
thus, we manually scanned another 100 random post- 
ings. Again, we observed a wide range of prices, from 
one buyer willing to pay only $0.06/1,000 emails, to an- 
other buyer willing to pay $5.00/1,000 emails. 

A final point of comparison is our own posting for bulk 
mailing services. We posted a job that involved sending 
bulk emails to three million individuals and received 10 
responses. Of the 10 responses, five included a price, and 
these prices ranged from $0.30 to $2 per 1,000 messages 
(with a median of $0.75/1,000 emails). 


4.3.2 Craigslist Ad Posting 


Posting an ad on Craigslist is typically free, but Craigslist 
takes special measures to restrict the number of ads 
posted by a single individual (e.g., IP rate limiting, 
CAPTCHAs, etc.). In the context of our study, when 
Freelancer.com buyers create jobs to “spam” Craigslist, 
their goal is to obtain repeated ad postings from workers, 
usually on a daily basis. This is done to keep a buyer’s 
ads at the top of the search results. Our classifier iden- 
tified 11,190 job postings of this type, 9,096 (81%) of 
which contained the service name “Craigslist” or a varia- 
tion thereof (in total comprising 1.1% of all jobs on Free- 
lancer.com).8 

Figure 8 shows the prices offered by buyers for a sin- 
gle Craigslist posting (top) and the average number of 
job posts per day pertaining to Craigslist ad posting (bot- 


©The most finished commodity is actual site traffic; however, traf- 
fic of reasonable quality (with respect to conversion rate) usually re- 
quires site-specific targeting and additional advertiser-provided mate- 
rial (“creatives”). 

7While we found several other kinds of spam-like jobs (e.g., bulk 
SMS), they did not represent a significant fraction of all jobs, and are 
not part of our study. 

8Classified ad sites BackPage and Kijiji represented 6.6% and 5.5% 
of jobs classified as ad posting; we chose to focus on Craigslist because 
it dominated this job category. 
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Figure 8: Median monthly prices offered by buyers for each 
Craigslist ad posted (top), and the monthly number of posts 
(bottom), both as a function of time. The solid vertical price 
bars show 25% to 75% price quartiles. The dashed vertical lines 
indicate approximate dates when Craigslist introduced phone 
verification for erotic services ads (March 2008) and other ser- 
vices (May 2008). The three bids received in response to our 
solicitation are indicated with a triangle on the right edge. 


tom). The solid circles indicate monthly median prices, 
and the solid bars show the 25% to 75% quartiles of the 
prices. In early March 2008, Craigslist added a phone 
verification requirement for posting in the erotic services 
section [4], and later extended the requirement to post- 
ing in other parts of the site some time in early May 
2008 (both dates indicated with dashed vertical lines in 
the graph). 

Figure 8 illustrates that the demand for posting to 
Craigslist started growing gradually after the policy 
changes, and the prices offered by buyers stayed es- 
sentially unchanged until mid-2009. Recall that in mid- 
2009, demand for phone verified accounts (which are 
dominated by Craigslist) appears to drop dramatically 
(Figure 4), having increased rapidly over the past year. 
Note, however, that the demand for Craigslist ad post- 
ing continues to rise during that same time period, nearly 
quadrupling in price within a year. 

To further compare pricing data, we posted a job de- 
scription on Freelancer soliciting bids for “Experienced 
Craigs List Posters.’ We received 10 responses, with 
three bids of $0.40, $0.60, and $0.65 per ad; these prices 
are shown for comparison purposes in the top graph of 
Figure 8 as solid triangles on the right edge. These prices 
are roughly in accordance with the buyer offers. 


4.4 Search Engine Optimization 

Search engine optimization (SEO) represents the second 
major advertising channel along with spam. SEO is a 
multi-billion dollar industry for improving the ranking 
of sites and pages returned in search results on popu- 
lar search engines. Improving the ranking of pages in 
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search results increases traffic to that page. “White hat” 
SEO improves the search rank of pages while obeying 
the guidelines provided by search engine companies like 
Google that prevent abuse of the indexing and ranking 
algorithms. “Black hat” SEO abuses the indexing and 
ranking algorithms, sacrificing the relevance of a page 
with the sole goal of attracting traffic via search results. 

There are three kinds of black hat SEO offerings 
on Freelancer.com, spanning the spectrum from least to 
most “finished”: content generation, link building, and 
search placement. 

Content generation increases the number of sites that 
contain indexable content together with links to a target 
page. This goal is achieved either by having writers gen- 
erate unique content for sites, often by rewriting existing 
material, or by using a semi-automated technique known 
as spinning. Spinning often uses structured templates to- 
gether with a variety of word, phrase, and sentence “dic- 
tionaries” to generate many variants of effectively the 
same content, and is analogous to the template-based 
techniques used to generate polymorphic spam that can 
defeat spam filters [11]. 

Link building is a more focused type of SEO job 
whose goal is to place links on pages with existing con- 
tent, emphasizing placement on pages with high rank as 
defined by search engines. Rather than generating and 
distributing content across many sites as a basis for im- 
proving the ranking of a target page, link building boot- 
straps on existing highly-ranked pages. 

The most finished kind of SEO job is search place- 
ment. The buyer does not care how the desired search 
placement is achieved, only that they place in the top 
search results on Google. Such jobs were relatively rare 
on Freelancer, and we only survey content generation and 
link building jobs in further detail. 


4.4.1 Content Generation 


A popular form of abusive SEO is to post “articles” 
to various sites and forums. These articles contain key- 
words and links intended to increase the search engine 
PageRank of a page in search results returned from 
queries that use the same keywords. With proper ac- 
counts (Section 4.1), the posting step can be automated. 
However, defenses implemented by search engines can 
detect automatically-generated article content. Such de- 
fenses have thereby created a demand for human workers 
to generate sufficiently realistic articles that defeat the 
countermeasures. Indeed, such article writing jobs rep- 
resent the most popular abusive job category by far, ac- 
counting for over 10% of all Freelancer jobs (Table 2). 
Article job descriptions request batches of 10-50 ar- 
ticles at a time in grammatically correct English on a 
particular topic, seek articles typically 250-500 words 
in length, and often have a variety of requirements 
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Figure 9: Median monthly prices offered by buyers for each 
article posted (top) and monthly average number of buyer posts 
per day (bottom), as a function of time. Vertical price bars show 
25% to 75% price quartiles. 














that reflect perceived countermeasures implemented by 
the search engines. A frequent requirement is sufficient 
“originality” (albeit often of simply rewritten text) to 
pass CopyScape, a popular plagiarism detection tool; 
such originality counters the capability of search engines 
to detect and discount similar content. Other such re- 
quirements request rewritten text beyond straightforward 
manipulation of existing content (simple synonym sub- 
stitution, transposing sentences, etc.). 

Figure 9 shows buyer demand and offered prices over 
time for article content generation jobs. Growth in de- 
mand for articles has been strong, with the number of 
jobs offered increasing linearly, with a peak of nearly 
3,000 article jobs posted in August 2010. This substantial 
growth in demand strongly suggests that article writing is 
indeed an effective form of SEO abuse. Yet prices for ar- 
ticles have been relatively stable over the past four years, 
with buyers offering $2—-4/article. 


4.4.2 Experiences Purchasing Articles 


To evaluate the quality of the articles written by Free- 
lancer workers, we solicited and employed ten workers 
to write six original articles on the topic of skin care 
products. We required each article to contain at least 400 
words, have a keyword density of at least 2%,? and pass 
the CopyScape [5] plagiarism detection system. Table 7 
shows the results of this assignment. Workers are identi- 
fied by their two-letter country and a digit. In addition to 
the three criteria above, we also computed the articles’ 
Flesch-Kincaid Grade Level [10]—a measure of text 


°Keyword density” is the frequency of occurrence of a set of key- 
words provided by the bidder to be included in the text. Keyword den- 
sity thresholds ensure that search engines index a Web page with re- 
spect to the specified keywords. In our experiment, we provided work- 
ers with keywords such as “dry skin moisturizer” and “exfoliating 
scrub”. 
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Failed articles 





ID Rating Len KD CS FKGL 
INs 9.50 - 6 - 88+1.0 
PH; 9.75 4 5 - 7.7+0.9 
BDg — - 4 - 81+0.7 
KW; 9.62 - 3 - 10.0+0.3 
INg 9.62 - 2 - 7.2+0.8 
UK, 10 - 1 2 9.0+0.5 
US, 10 - 1 8.6 + 0.2 
BDg 9.81 - 1 - 93+0.5 
AU; - - 1 - 11.0+1.0 
KE, 10 - - - 96+1.0 





Table 7: Quality of articles written by workers on the topic of 
skin care products. Columns Len, KD, and CS show how many 
of each worker’s six articles failed the length, keyword density, 
and CopyScape plagiarism detection requirements. The FKGL 
column shows the Flesch-Kincaid Grade Level [10] range of 
each worker’s text after excluding their lowest and highest scor- 
ing articles. Country codes: PH — Philippines, IN — India, BD 
— Bangladesh, KW — Kuwait, UK — United Kingdom, US — 
United States, AU — Australia, KE — Kenya. 


readability based on word and sentence length, roughly 
indicating the school grade level required to comprehend 
the text. The FKGL column shows the score range of the 
work produced by each worker after excluding their low- 
est and highest scoring articles. 

Quality of the work produced by the ten workers var- 
ied considerably. More than half of the articles produced 
by workers INs, PH;, and BDg did not meet our 2% key- 
word density requirement; in addition, PH, failed to pro- 
duce articles of the required length (400 words). On the 
other hand, half of the workers produced articles satisfy- 
ing our criteria in at least five out of six cases. Unfortu- 
nately, two of the articles produced by UK2 did not pass 
the CopyScape plagiarism detection tool, and as such, 
would likely not be indexed by search engines. 

Articles written by the workers were understandable 
and on topic. The Flesch-Kincaid Grade Level of the ar- 
ticles reveals a notable level of English composition. For 
comparison, five Wikipedia articles on the same topic 
had scores in the range 12.1 + 0.5, while six articles 
from Cosmopolitan—a popular women’s magazine in 
the US—about skin care fell in the 7.9 + 0.8 range. Thus, 
at least with respect to SEO, our results show Freelancer 
to be a useful source of inexpensive content that would 
be difficult to distinguish mechanically from work pro- 
duced by more highly-paid specialist writers. 








4.4.3 Link Building 


Google reports a PageRank (PR) metric for every page, 
accessible via the Google Toolbar. The PR ranges from 
0-10, with new and least popular pages having a PR of 
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Figure 10: Average price buyers offered for backlinks on pages 
with a given PageRank (PR). Higher PRs correspond to more 
popular (and valuable) pages. The number above the bar corre- 
sponds to the number of jobs requesting backlinks of that PR. 


0 and the highest ranked pages having a PR of 10. This 
PageRank is a combination of the number of sites that 
link to the page—so-called backlinks—and the PageR- 
anks of the pages with the backlinks. Not surprisingly, 
another common SEO abuse is to increase the number of 
sites that backlink to a page, and to have those backlinks 
on sites with high PageRank. 

Hiring people to perform this kind of SEO task is 
another frequent kind of abusive job on Freelancer, ac- 
counting for over 3% of all jobs. We placed such link- 
building tasks into two categories, “white hat” and “grey 
hat”. White hat link building jobs have requirements that 
specifically try to avoid search engine countermeasures, 
such as no link farms, no blacklisted sites, no redirects or 
JavaScript links, links on sites with generic top-level do- 
mains, and so on. Jobs also specify the PageRank of the 
pages on which the backlinks will be placed, and that the 
buyer will validate the links created according to all of 
their criteria. Grey hat link building is much more indis- 
criminate, such as spamming blogs with links embedded 
in comments. 

How much do people value backlinks as an SEO tech- 
nique? The job postings quantify this value in economic 
terms. For the “white hat” link building jobs for which 
we could automatically extract pricing data, Figure 10 
shows that the median price per backlink buyers offered 
is directly correlated with the PageRank (PR) of the page 
containing the backlink. One buyer offered over $25 
per backlink on pages with PR8, while buyers offered 
nearly $5 per backlink on PR4 pages, the most popularly- 
requested PR. 

Next, we look more closely at buyers who posted 
“grey hat” link building jobs, or ones that allow for such 
questionable SEO methods as blog commenting, forum 
posting, etc. For these Freelancer job postings, buyers of- 
tentimes directly specify the URL that they are interested 
in using greyhat techniques on. We extracted over two 
thousand URLs that were present in the body of the grey- 
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Domain Name Num. Sites Num. Inlinks 





Blogspot 316 10,028 
Wordpress 213 2,402 
Yahoo 147 1,187 
ArticlesBase 143 747 
Folkd 108 302 
ArticleSnatch 107 491 
Google 97 184 
Squidoo 88 154 
Diigo 88 277 
ArticleAlley 88 471 


Table 8: Summary of top 10 targeted domain names for greyhat 
link purchasing. 
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Figure 11: Distributions of countries for buyers and bidders. 


hat link building posts. Using Yahoo Site Explorer [13], 
we checked the first 1,000 inlinks (restricted by the API) 
pointing to each URL. Then, we filtered URLs with more 
than 1,000 inlinks remaining (i.e., not retrievable via the 
Yahoo API), yielding 813 sites. Table 8 shows the top 
domain names for the inlinks. As expected, Blogspot and 
Wordpress are highly targeted for link spamming. Yahoo 
Answers and Groups, as well as Google Knol and Google 
Sites, are also targeted. 


5 User Analysis 


We end our investigation of Freelancer activity by sur- 
veying the geographic demographics and job specializa- 
tion of Freelancer users. 


5.1 Country of Origin 


There are clear demographic differences between buyers 
and bidders. Figure 11 shows the distribution of coun- 
tries of origin for all buyers and bidders of the abuse- 
related jobs categorized in Table 2. (The distribution for 
selected workers closely follows the overall bidder distri- 
bution.) We extract the country of origin for users from 
their profile information. We note that this information 
is self-reported and nothing prevents users from being 
dishonest; further, we have seen instances where buy- 
ers post jobs specifically avoiding bidders from India, 
for instance, providing a potential motive for dishonesty. 
Numbers for such countries are therefore a lower bound. 
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Figure 12: Top five countries of buyers posting abusive jobs. 


The largest group of buyers is from the United States, 
and other English-speaking countries feature promi- 
nently (UK, Canada, Australia, even India). In contrast, 
the largest group of bidders is from India, followed by 
neighboring Pakistan and Bangladesh—countries with 
a large cheap labor force, substantial Internet penetra- 
tion, and where English is an official language or has 
widespread fluency. 

The country of origin demographics for each cate- 
gory reveals yet more detail. Figures 12 and 13 show 
the top five countries of buyers and bidders, respectively, 
for each abusive job category in Table 2. Buyers for ad- 
vertisement posting (generally targeting Craigslist, Sec- 
tion 4.3.2) are primarily from the United States, whereas, 
somewhat surprisingly, buyers for human CAPTCHA 
solvers are primarily from Bangladesh and India—these 
are buyers looking to form teams of solvers. Bidders 
from India and Bangladesh dominate white hat and so- 
cial networking link building jobs, respectively. Bidders 
from the only Western country (US) in the top five tar- 
get article generation, creating PVAs, and advertisement 
posting. 


5.2 Specialization 


Aside from some uniform basic fundamental require- 
ments, such as understanding English and having access 
to and basic knowledge of the Internet, the abuse jobs 
posted on Freelancer essentially require unskilled labor. 
As aresult, Freelancers need not necessarily specialize— 
focus solely on a particular job category—in the tasks 
that they undertake. 

As one metric of whether specialization occurs or not, 
we examined whether buyers and bidders participated in 
more than one category of job (for those buyers and bid- 
ders who engaged in more than one job). Indeed, bidders 
clearly do not specialize. For all but one category, on av- 
erage fewer than 5% of the jobs that bidders bid on are 
within the same category; the exception is article content 
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Figure 13: Top five countries of bidders on abusive jobs. 


generation, where nearly 15% of bids per bidder are on 
other article jobs. Moreover, not only are most bids on 
other job categories, but the majority of bids are on jobs 
that did not even fall into an abuse category in Table 2. 
In other words, for bidders who bid on at least one abuse 
job, 70-80% of their other bids were for a non-abuse job. 

Buyers follow a similar pattern as bidders, but are 
slightly more focused: 10% of a buyer’s jobs, on aver- 
age, are for jobs in the same category, while 60-70% 
of a buyer’s jobs were for a non-abuse job. Article con- 
tent generation again is the one exception, with 30% ofa 
buyer’s jobs requesting articles. 


6 Discussion 


Figure 14 illustrates how the various markets described 
in this study fit together in the Web abuse chain. At the 
lowest level, workers need access to Web proxies (due 
to account registration limits placed on IP addresses), 
CAPTCHA solvers/OCR packages, and phone numbers. 
Utilizing these components, abusers can create Web- 
based email accounts, the primary building blocks for 
service abuse. The email accounts can be used to reg- 
ister accounts for a number of Web services, including 
Craigslist, Facebook, Twitter, Digg, etc. 

The abusers can then implement various monetiza- 
tion schemes with the accounts, most of them involving 
“spamming”. The most direct form of spamming utilizes 
the Web email accounts to send spam. Craigslist PVAs al- 
low abusers to post repeated, daily advertisements, mak- 
ing a retailer’s product consistently appear near the top 
of the search results. Abusers can use social network- 
ing accounts in several ways, the most direct involving 
the creation of social links (fan, friend, follower, etc.) for 
marketing purposes. 

The relationship between this ecosystem and SEO is 
subtle: the accounts on social networking sites can also 
be used for SEO purposes. For example, abusers may 
spam blogs with comments that link to a Web page to ob- 
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tain more backlinks for the site. Abusers may also sub- 
mit links to social bookmarking sites, or utilize forum 
accounts to create posts containing links (most often in 
the signature field). Many of these SEO jobs require con- 
tent, either in the form of articles, or actual content to in- 
clude in blog comments or forum posts. Lastly, abusers 
can also directly purchase backlinks on sites. 


7 Conclusion 


This paper demonstrates how web service abuse can be 
augmented by the use of low-cost freelance labor. Seven 
years of historical data have allowed us to collect infor- 
mation on abuse-related work on freelancer.com, one of 
the largest online websites offering piecework labor out- 
sourcing. Potential employers offered jobs such as link 
building on social network sites, mass email account cre- 
ation, and tasks related to search engine optimization. In 
addition, we found that the demand for freelancers to fill 
these jobs is being matched by an increase in the number 
of freelancers around the world who will compete for the 
work. 

Freelancer.com, and other sites that offer freelance 
jobs and employment are prime sources of new types 
of service abuse. The willingness of many freelancers to 
take part in these schemes allow those who offer the jobs 
to quickly ascertain new schemes and their success rate; 
if they are judged to be profitable, the jobs quickly be- 
come a staple income for the willing freelancer and thus, 
the employer. Services developed by experts to ensure 
the security of websites, such as CAPTCHA technology, 
are now targeted by employers who hire freelancers to 
break encoding and circumvent the site’s security mea- 
sures. These trends point to the need for anti-abuse for- 
tifications that will defend against attackers who have a 
workforce of virtually unlimited knowledge at an inex- 
pensive price. !° 


!0The conclusion of this paper is an example of article rewriting: 
modifying text to pass plagiarism detection systems like CopyScape, 
commonly as a means of producing high-quality SEO content. The 
original text, given to the freelancer, is given below: 
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A Interesting Jobs 


This appendix includes representative real jobs posted to 
Freelancer from all the job groups. These examples provide 
context and help to clarify the various legitimate and dirty job 
categories. 


A.1 Legitimate 


Private. project has already be awarded to <...>. thanks 


Legitimate Miscellaneous. I have a simple document for 
translation from Dutch to English. Those who are available for 
immediate start and freelancers only apply. 


A.2 Accounts 


Human CAPTCHA Solving. —PixProfit.com is the portal 
for data-typist. We’re looking for individuals or team of 
data-entering workers. We’ll pay from $1 for 1000 correctly 
typed images. 


Phone Verified Accounts. We are looking for a reliable 
provider of new CL Phone Verified Accounts(PVA). Will be 
buying up to 1000-2000/month. Willing to pay no more than 
$2.00/PVA Or best offer. 


A.3 SEO 


SEO Content Generation. 

I need 20 articles written about penis enlargement and 40 
articles written about male enhancement. The total is 60 
articles with the following requirements. Your writing must be 
your own original work (no article spinning). Length 500-600 
words per article. Written in excellent english with perfect 
grammar. Keyword density of 2%. 


SEO Spinning Article. I am looking for native content 
providers to provide me articles with spinner syntax. 
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Something like this : {Deciding||Determining} in what 
{type||kind||sort} of credit card to {apply||go for||lend 
oneself||put on||employ} for {depends||counts||reckons} on 
your {past||previous||recent|| former} credit 

{history ||account||report||theme}. Providers without prior 
spinning knowledge, Please don’t bid. I will pay 1.5 USD per 
spun article to start with only through Paypal. 


Link Building/Grey Hat. I am looking to outsource large 
numbers of blog commenting. Quality blog commenter 
needed. Can provide 1000 comments per week upwards. 
This will be for a trial of 100-200 comments per week. 


Link Building/White Hat. 100 Gambling Links from 
related PR 4 or higher pages. All on different sites and servers 
Requirements: No link farms, link-exchange programs, No 
black hat links or Tricks. 


SEO Miscellaneous. keyword : trader joes 

website : will mention via message 

SE: google.com 

i wan’t my website rank 1 in google.com. If interested pls send 
detail what is your skill to get this website top on google.com 


A.4 Spamming 


Human Oriented Postings. I need per day 2K Classified 
Ad Posting for my site I willings to pay for it $100. Per ad 
$0.05 


A.5 OSN Linking 


Create Social Networking Links. I am lonely I want to 
give my facebook account details to someone and have them 
populate it with 5000 English speaking friends help me please. 


A.6 


Abuse Tools. The first tool necessary is Micro Niche 
Finder. You will need this to do keyword research, and select 
keywords based on our requirements. The tool will also allow 
you to see which keywords have .com, .org, or .net domains 
available. Once the available domains have been determined, 
we will review your picks, and purchase them after approval. 
Once purchased, you will need to create articles for each page, 
and install the necessary wordpress theme and plugins. Once 
this is complete, you will need to run SE Nuke or Evo II for 
each site, at least 4 times per month. 


Miscellaneous 


Academic Fraud. _ For this project, you will put together 
several techniques and concepts learned in CS <deleted > and 
some new techniques to make an application that searches a 
large database of people which we will call a Personal 
Information Manager (PIM), even though it only contains a 
few fields, and even fewer advanced functions. This project 
creates a simple program that allows people to enter names or 
email addresses and check whether they are found in the PIM. 


Account Creation Tools. Hey all! I’m in need of US 
telephone numbers with call forwarding for CL PVA creation. 
Please quote your rate. Bids lower or equal to $1 will be given 
higher priority. 

Other Malicious. Hello, I have a small sized EXE file of 
40KB and I need someone who can build a script who will 
DOWNLOAD AND EXECUTE the EXE file 
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AUTOMATICALLY. What I mean by automatically? By 
entering a single URL in the browser. 

Here is a PERFECT example: http://www.<deleted>.com. In 
the example above the EXE file is EXECUTED even when 
you click on CANCEL in the javascript prompt screen. 


B Interesting Bids 


This appendix includes representative real bids received from 
Freelancer workers from some abuse job groups. These bids 
shed light into the various tools and techniques used by 
workers to circumvent Web security mechanisms. Also, the 
bids provide some insight into worker demographics. 


B.1 Accounts 


Account Creation. 1 Account create on 1 ip, 
Cookies/Cache is cleared after every account automatically. 
All accounts are created using real human names. We have the 
ability to provide accounts as per your required format. 
We created those account with this requirement as below: 

1) All Gmail accounts created with unique US IP Addresses 2) 
All Gmail accounts created separate/unique passwords 3) All 
accounts created a prefix with names &/or words. Preferably 
no numbers 4) All accounts to have random First and Last 
names assigned. 5) All passwords have minimum of 8 
characters and preferably alpha-numeric 


B.2, SEO 


SEO Content Generation. _Hi!I am <deleted>.] am 
currently a stay at home mom with 9 month old daughter so I 
currently have free time throughout the day. I can write quality 
articles/blogs, academic research papers and LSI/SEO written 
content of any nature.These articles are put through 
Copyscape premium dupe test before submission. Also find 
attached a sample News article I did for a local News paper.I 
assure you that your articles will be written in the most 
professional manner possible. I charge $1 per 100 word.I look 
forward to working with you.Take care 





B.3. Spamming 


Create Social Networking Links. Techniques:(100% white 
hat) 1. Following people manually: Twitter let us follow 500 
people in a day and maximum 2000 follow using one account. 
So i found a nice technique by which i am able to make 1000 
follower. That is 

#First follow huge people manually up to 500 using an 
account similar to your account and after following 500 i will 
receive a massage, “You have cross the hourly limit . You cant 
follow now”. Then i will use another account to follow 
targeted follower up to 500... 


B.4 OSN Linking 


Human Oriented Postings. I am experienced with the CL 
posting .Now i am working use for daily posting (RDSL With 
AT@T Line ,CLAD Soft, Ip rental,Proxy,AOL,US hide IP, 
line with Logmein soft Or,Team Viewer & go to my PC), We 
have so much experience a team for all adds posting site such 
classified site) also have all requirements which need your 
project done. 
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Abstract 


Modern spam is ultimately driven by product sales: 
goods purchased by customers online. However, while 
this model is easy to state in the abstract, our under- 
standing of the concrete business environment—how 
many orders, of what kind, from which customers, for 
how much—is poor at best. This situation is unsurpris- 
ing since such sellers typically operate under question- 
able legal footing, with “ground truth” data rarely avail- 
able to the public. However, absent quantifiable empiri- 
cal data, ““guesstimates” operate unchecked and can dis- 
tort both policy making and our choice of appropri- 
ate interventions. In this paper, we describe two infer- 
ence techniques for peering inside the business opera- 
tions of spam-advertised enterprises: purchase pair and 
basket inference. Using these, we provide informed esti- 
mates on order volumes, product sales distribution, cus- 
tomer makeup and total revenues for a range of spam- 
advertised programs. 


1 Introduction 


A large number of Internet scams are “advertising- 
based”; that is, their goal is to convince potential cus- 
tomers to purchase a product or service, typically via 
some broad-based advertising medium.! In turn, this ac- 
tivity mobilizes and helps fund a broad array of technical 
capabilities, including botnet-based distribution, fast flux 
name service, and bulletproof hosting. However, while 
these same technical aspects enjoy a great deal of atten- 
tion from the security community, there is considerably 
less information quantifying the underlying economic 
engine that drives this ecosystem. Absent grounded em- 
pirical data, it is challenging to reconcile revenue “esti- 
mates” that can range from $2M/day for one spam bot- 
net [1], to analyses suggesting that spammers make little 


‘Unauthorized Internet advertising includes email spam, black hat 
search-engine optimization [26], blog spam [21], Twitter spam [4], fo- 
rum spam, and comment spam. Hereafter we refer to these myriad ad- 
vertising vectors simply as spam. 
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money at all [6]. This situation has the potential to distort 
policy and investment decisions that are otherwise driven 
by intuition rather than evidence. 

In this paper we make two contributions to improving 
this state of affairs using measurement-based methods to 
estimate: 


e Order volume. We describe a general technique— 
purchase pair—for estimating the number of orders 
received (and hence revenue) via on-line store order 
numbering. We use this approach to establish rough, 
but well-founded, monthly order volume estimates 
for many of the leading “affiliate programs” selling 
counterfeit pharmaceuticals and software. 


e Purchasing behavior. We show how we can use 
third-party image hosting data to infer the contents 
of customer “baskets” and hence characterize pur- 
chasing behavior. We apply this technique to a lead- 
ing spamvertized pharmaceutical program and iden- 
tify both the nature of these purchases and their re- 
lation to the geographic distribution of the customer 
base. 


In each case, our real contribution is less in the par- 
ticular techniques—which an adversary could easily de- 
feat should they seek to do so—but rather in the data that 
we used them to gather. In particular, we document that 
seven leading counterfeit pharmacies together have a to- 
tal monthly order volume in excess of 82,000, while three 
counterfeit software stores process over 37,000 orders in 
the same time. 

On the demand side, as expected, we find that most 
pharmaceuticals selected for purchase are in the “male- 
enhancement” category (primarily Viagra and other ED 
medications comprising 60 distinct items). However, 
such drugs constitute only 62% of the total, and we doc- 
ument that this demand distribution has quite a long tail; 
user shopping carts contain 289 distinct products, includ- 
ing surprising categories such as anti-cancer medications 
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(Arimidex and Gleevec), anti-schizophrenia drugs (Sero- 
quel), and asthma medications (Advair and Ventolin). 
We also discover significant differences in the purchas- 
ing habits of U.S. and non-U.S. customers. 

Combining these measurements, we synthesize overall 
revenue estimates for each program, which can be well 
in excess of $1M per month for a single enterprise. To 
the best of our knowledge, ours is the first empirical data 
set of its kind, as well as the first to provide insight into 
the market size of the spam-advertised goods market and 
corresponding customer purchasing behavior. 

We structure the remainder of this paper as follows. 
In § 2 we motivate the need for such research, explain 
the limitations of existing data, and provide background 
about how the spam-advertised business model works to- 
day. We discuss our purchase pair technique in § 3, val- 
idating our technique for internal consistency and then 
presenting order volume estimates across seven of the 
top pharmaceutical affiliate programs and three counter- 
feit software programs. We then explore the customer dy- 
namics for one particular pharmaceutical program, Eva- 
Pharmacy, in § 4. We explain how to use image log data 
to identify customer purchases and then document how, 
where and when the EvaPharmacy customer base places 
its orders. We summarize our findings in § 5, devising 
estimates of revenue and comparing them with external 
validation. We conclude with a discussion about the im- 
plications of our findings in § 6. 


2 Background 


The security community is at once awash in the tech- 
nical detail of new threats—the precise nature of a new 
vulnerability or the systematic analysis of a new botnet’s 
command and control protocol—yet somewhat deficient 
in analyzing the economic processes that underlie these 
activities. In fairness, it is difficult to produce such anal- 
yses; there are innate operational complexities in acquir- 
ing such economic data and inherent uncertainties when 
reasoning about underground activities whose true scope 
is rarely visible directly. 

However, absent a rigorous treatment, the resulting in- 
formation vacuum is all too easily filled with opinion, 
which in turn can morph into “fact” over time. Though 
pervasive, this problem seemingly reached its zenith in 
the 2005 claim by US Treasury Department consultant 
Valerie McNiven that cybercrime revenue exceeded that 
of the drug trade (over $100 billion at the time) [11]. 
This claim was frequently repeated by members of the 
security industry, growing in size each year, ultimately 
reaching its peak in 2009 with written Congressional tes- 
timony by AT&T’s chief security officer stating that cy- 
bercrime reaped “more than $1 trillion annually in illicit 
profits” [23]—a figure well in excess of the entire soft- 
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ware industry and almost twice the GDP of Germany. 
Nay-sayers are similarly limited in their empirical evi- 
dence. Perhaps best known in this group are Herley and 
Florencio, who argue that a variety of cybercrimes are 
generally unprofitable. However, lacking empirical data, 
they are forced to use an economic meta-analysis to make 
their case [5, 6, 7]. 

Unfortunately, the answer to such questions matters. 
Without an “evidence basis”, policy and investment de- 
cisions are easily distorted along influence lines, either 
over-reacting to small problems or under-appreciating 
the scope of grave ones. 


2.1 Estimating spam revenue and demand 


In this paper we examine only a small subset of such 
activity: spam-advertised counterfeit pharmacies and, to 
a lesser extent, counterfeit software stores. However, 
even here public estimates can vary widely. In 2005, 
one consultancy estimated that Russian spammers earned 
roughly US$2-3M per year [18]. However, in a 2008 
interview, one IBM representative claimed that a single 
spamming botnet was earning close to $2M per day [1]. 
Our previous work studied the same botnet empirically, 
leading to an estimate of daily revenue of up to $9,500, 
extrapolating to $3.5M per year [10]. Most recently, a re- 
port by the Russian Association of Electronic Communi- 
cation (RAEC) estimated that Russian spammers earned 
3.7 billion rubles (roughly $125 million) in 2009 [12]. 

The demand side of this equation is even less well 
understood, relying almost entirely on opt-in phone or 
email polls. In 2004, the Business Software Alliance 
sponsored a Forrester Research poll to examine this 
question, finding that out of 6,000 respondents (spread 
evenly across the US, Canada, Germany, France, the UK 
and Brazil) 27% had purchased spam-advertised soft- 
ware and 13% had purchased spam-advertised pharma- 
ceuticals [3]. If such data were taken at face value, the US 
market size for spam-advertised pharmaceuticals would 
exceed 30 million customers. Similar studies, one by 
Marshal in 2008 and the other sponsored by the Mes- 
saging Anti-Abuse Working Group (MAAWG) in 2009, 
estimate that 29% and 12%, respectively, of Internet 
users had purchased goods or services advertised in spam 
email [8, 19]. 

In our previous work on empirically quantifying rev- 
enue for such activities, our measurements were only 
able to capture a few percent of orders for sites adver- 
tised by a single botnet serving a single affiliate program, 
GlavMed [10]. Here, we aim to significantly extend our 
understanding, with our results covering total order vol- 
ume for five of the six top pharmacy affiliate programs, 
and three of the top five counterfeit software affiliate pro- 
grams. Moreover, to the best of our knowledge our anal- 
ysis of EvaPharmacy is the first measurement-based ex- 
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amination of customer purchasing behavior, the demand 
component of the counterfeit pharmacy ecosystem. 


2.2 How spam-advertised sites work 


To provide context for the analysis in this paper, we first 
describe how modern spam is monetized and the ecosys- 
tem that supports it. 

Today, spam of all kinds represents an outsourced mar- 
keting operation in service to an underlying sales activ- 
ity. At the core are “affiliate programs” that provide retail 
content (e.g., storefront templates and site code) as well 
as back-end services (e.g., payment processing, fulfill- 
ment and customer support) to a set of client affiliates. 
Affiliates in turn are paid on a commission basis (typ- 
ically 30-50% in the pharmaceutical market) for each 
sale they bring in via whatever advertising vector they 
are able to harness effectively. This dynamic is well de- 
scribed in Samosseiko’s “Partnerka” paper [22] and also 
in our recent work studying the spam value chain [16]. 

Thus, while an affiliate has a responsibility to attract 
customers and host their shopping experience (which in- 
cludes maintaining the contents of their “shopping cart’), 
once a customer decides to “check out” the affiliate hands 
the process over to the operators of the affiliate program.” 
Consequently, we would expect to find the order process- 
ing service shared across all affiliates of a particular pro- 
gram, regardless of the means used to attract customers. 
Indeed, as discussed below, our measurements of pur- 
chases from different members of the same affiliate con- 
firm that the order numbers associated with the purchases 
come from acommon pool. This finding is critical for our 
study because it means that side-effects in the order pro- 
cessing phase reflect the actions of all sales activity for 
an entire program, rather than just the sales of a single 
member. 

On the back end, order processing consists of sev- 
eral steps: authorization, settlement, fulfillment, and cus- 
tomer service. Authorization is the process by which 
the merchant confirms, through the appropriate payment 
card association (e.g., Visa, MasterCard, American Ex- 
press, Japan Credit Bureau, etc.), that the customer has 
sufficient funds. For the most common payment cards 
(Visa/MC), this process consists of contacting the cus- 
tomer’s issuing bank, ensuring that the card is valid and 
the customer possesses sufficient funds, and placing a 
lien on the current credit balance. Once the good or ser- 
vice is ready for delivery, the merchant can then execute 
a settlement transaction that actualizes this lien, transfer- 
ring money to the merchant’s bank. Finally, fulfillment 
comprises packaging and delivery (e.g., shipping drugs 


?This transfer typically takes the form of a redirection to a pay- 
ment gateway site (with the affiliate’s identity encoded in the request), 
although some sites also support a proxy mode so the customer can 
appear to remain at the same Web site. 
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directly from a foreign supplier or providing a Web site 
and password for downloading software). For our study, 
however, the key leverage lies in customer service. To 
support customer service, payment sites generate indi- 
vidual order numbers to share with the customer. In the 
next section, we describe how we can use the details of 
this process to infer the overall transaction rate, and ulti- 
mately revenue, of an entire affiliate program. 


3 Order volume 


Underlying our purchase pair measurement approach is 
a model of how affiliate programs handle transactions, 
and, in particular, how they assign order numbers. 


3.1 Basic idea 


Upon placing an order, most affiliate programs provide a 
confirmation page that includes an “order number” (typ- 
ically numeric, or at least having a clear numeric compo- 
nent) that uniquely specifies the customer’s transaction. 
For purchases where an order number does not appear 
on the confirmation page, the seller can provide one in 
a confirmation email (the common case), or make one 
available via login to the seller’s Web site. The order 
number allows the customer to specify the particular pur- 
chase in any subsequent emails, when using customer 
support Web sites, or when contacting online support 
via email, IM or live Web chat. For the purchases we 
made, we found that the seller generally provides the or- 
der number before the authorization step (indeed, even 
before merchant-side fraud checks such as Address Ver- 
ification Service), although purely local checks such as 
Luhn digit validation are frequently performed first. Ac- 
cordingly, we can consider the creation of an order num- 
ber only as evidence that a customer attempted an order, 
not that it successfully concluded. Thus, the estimates we 
form in this work reflect an upper bound on the transac- 
tion rate, including transactions declined during autho- 
rization or settlement.? 

The most important property for such order numbers 
is their uniqueness; that each customer order is assigned 
a singular number that is distinguished over time with- 
out the possibility of aliasing. While there are a vast 
number of ways such uniqueness could be implemented 
(e.g., a pseudo-random permutation function), the easi- 
est approach by far is to simply increment a global vari- 
able for each new order. Indeed, the serendipitous ob- 
servation that motivated our study was that multiple pur- 
chases made from the same affiliate program produced 


3In 2008, Visa documented that card-not-present transactions such 
as e-commerce had an issuer decline rate of 14% system-wide [25]. In 
addition, it seems likely that some orders are declined at the merchant’s 
processor due to purely local fraud checks (such as per-card or per- 
address velocity checks or disparities between IP address geolocation 
versus shipping address). 
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order numbers that appeared to monotonically increase 
over time. Observing the monotonic nature of this se- 
quence, we hypothesized that order number allocation is 
implemented by serializing access to a single global vari- 
able that is incremented each time an order is made; we 
call this the sequential update hypothesis. To assess this 
hypothesis, we examined source code for over a dozen 
common e-commerce platforms (e.g., Magento, X-cart, 
Ubercart, and Zen-cart [17, 24, 27, 28]), finding ubiqui- 
tous use of such a counter, typically using an SQL auto- 
update field, but sometimes embodied explicitly in code. 

Given use of such a global sequential counter, the 
difference between the numbers associated with orders 
placed at two points in time reflects the total number of 
orders placed during the intervening time period. Thus, 
from any pair of purchases we can extract a measure- 
ment of the total transaction volume for the interval of 
time between them, even though we cannot directly wit- 
ness those intervening transactions. Figure 1 illustrates 
the methodology using a concrete example. This obser- 
vation is similar in flavor to the analysis used in blind/idle 
port scanning (there the sequential increment of the IP 
identification field allows inference of the presence of 
intervening transmissions) [2]. It then appears plausible 
that this same purchase-pair approach might work across 
a broad range of spam-advertised programs, a possibility 
that we explore more thoroughly next. 


3.2 Data collection 


To evaluate this approach requires that we first identify 
which sites advertise which affiliate programs, and then 
place repeated purchases from each. We describe how we 
gathered each of these data sets in this section. 


Program data 


In prior work, we developed a URL crawler to follow 
the embedded links contained in real-time feeds of email 
spam (provided by a broad range of third-party anti- 
spam partners) [16]. The crawler traverses any redirec- 
tion pages and then fetches and renders the resulting page 
in a live browser. We further developed a set of “page 
classifiers” that identify the type of good being adver- 
tised by analyzing the site content, and, in most cases, 
the particular affiliate program being promoted. We de- 
veloped specific classifiers for over 20 of the top phar- 
maceutical programs (comprising virtually all sites ad- 
vertised in pharmaceutical spam), along with the four 
most aggressively spam-advertised counterfeit software 
programs. 

After placing multiple test orders with nine of these 
pharmaceutical programs, we identified seven with 
strictly incrementing order numbers.* Five of these (Rx— 


4Of the two programs that we did not select, ZedCash used several 
different strictly increasing order number subspaces that would compli- 
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Promotion, Pharmacy Express (aka Mailien), GlavMed, 
Online Pharmacy and EvaPharmacy) together consti- 
tuted two-thirds of all sites advertised in the roughly 
350 million distinct pharmaceutical spam URLs we ob- 
served over three months in late 2010. We found the 
sixth, 33drugs (aka DrugRevenue), and seventh, 4RX, 
less prevalent in email spam URLs, but they appear to 
be well advertised via search engine optimization (SEO) 
techniques [15]. We did a similar analysis of counterfeit 
software programs, finding three (Royal Software, Eu- 
roSoft, and SoftSales) with the appropriate order-number 
signature. While counterfeit software is less prevalent in 
total spam volume, these three programs constitute over 
97% of such sites advertised to our spam collection appa- 
ratus during the same 3-month period. For the remainder 
of this paper we focus exclusively on these ten programs, 
although it appears plausible that the same technique will 
prove applicable to many smaller programs, and also to 
programs in other such markets (e.g., gambling, fake an- 
tivirus, adult). 


Order data 


We collected order data in two manners: actively via our 
own purchases and opportunistically, based on the pur- 
chases of others. First and foremost are our own pur- 
chases, which we conducted in two phases. The first 
phase arose during a previous study, during which we 
executed a small number of test purchases from numer- 
ous affiliate programs in January and November of 2010 
using retail Visa gift cards. Of these, 46 targeted the ten 
programs under study in this paper. The second phase 
(comprising the bulk of our active measurements) re- 
flects a regimen of purchases made over three weeks in 
January and February 2011 focused specifically on the 
ten programs we identified above. 

When placing these orders, we used multiple distinct 
URLs leading to each program (as identified by our page 
classifiers). The goal of this procedure was to maximize 
the likelihood of using distinct affiliates to place pur- 
chases in order to provide an opportunity to determine 
whether different affiliates of a given program make use 
of different order-processing services. 

Successfully placing orders had its own set of op- 
erational challenges [9]. Except where noted, we per- 
formed all of our purchases using prepaid Visa credit 
cards provided to us in partnership with a specialty is- 
suer, and funded to cover the full amount of each trans- 
action. We used a distinct card for each purchase and 
went to considerable lengths to emulate real customers. 
We used valid names and associated residential shipping 
addresses, placed orders from a range of geographically 


cate our analysis and decrease accuracy, while World Pharmacy order 
numbers appeared to be the concatenation of a small value with the 
current Unix timestamp, which would thwart our analysis altogether. 
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Figure 1: How the purchase pair technique works. In this hypothetical situation, two measurement purchases are made that bracket 
some number of intervening purchases made by real customers. Because order number allocation is implemented by a serialized 
sequential increment, the difference in the order numbers between measurement purchases, N = 23, corresponds to the total 
number of orders processed by the affiliate program in the intervening time. 


proximate IP addresses, and provided a unique email ad- 
dress for each order. We used five contact phone numbers 
for order confirmation, three from Google Voice and two 
via prepaid cell phones, with all inbound calls routed to 
the prepaid cell phones. In a few instances we found it 
necessary to place orders from IP addresses closely ge- 
olocated to the vicinity of the billing address for a given 
card, as the fraud check process for one affiliate program 
(EuroSoft) was sensitive to this feature. Another program 
(Royal Software) would only accept one order per IP ad- 
dress, requiring IP address diversity as well. 

In total we placed 156 such orders. We scheduled them 
both periodically over a three-week period as well as 
in patterns designed to help elucidate more detail about 
transaction volume and to test for internal consistency, as 
discussed below. 

Finally, in addition to the raw data from our own 
purchase records, we were able to capture several pur- 
chase order numbers via forum scraping. This opportu- 
nity arose because affiliate programs typically sponsor 
online forums that establish a community among their 
affiliates and provide a channel for distributing opera- 
tional information (e.g., changes in software or name 
servers), sharing experiences (e.g., which registrars will 
tolerate domains used to host pharmaceutical stores), and 
to raise complaints or questions. One forum in particular, 
for the GlavMed program, included an extended “com- 
plaint” thread in which individual affiliates complained 
about orders that had not yet cleared payment process- 
ing (important to them since affiliates are only paid for 
each settled transaction that they deliver). These affiliates 
chose to document their complaints by listing the order 
number they were waiting for, which we determined was 
in precisely the same format and numeric range as the 
order numbers presented to purchasers. By mining this 
forum we obtained 122 numbers for past orders, includ- 
ing orders dating back to 2008. 
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. Phase | Phase 2 
ABMs rogram: pio) CAl 2th 
Rx—Promotion 7 27 
Pharmacy Express 3 9 
GlavMed 12 14 
Online Pharmacy 5 16 
EvaPharmacy 7 16 
33drugs 4 16 
4RX 1 13 
EuroSoft 3 25 
Royal Software 2 9 
SoftSales 2 11 


Table 1: Active orders placed to sites of each affiliate program 
in the two different time phases of our study. In addition, we op- 
portunistically gathered 122 orders for GlavMed covering the 
period between 2/08 and 1/11. 


Note that this data contains an innate time bias since 
the date of complaint inevitably came a while later than 
the time of purchase (unlike our own purchases). For this 
reason, we identify opportunistically gathered points dis- 
tinctly when analyzing the data. We will see below that 
the bias proves to be relatively minor. 

We summarize the total data set in Table 1. It includes 
order numbers from 202 active purchases and 122 oppor- 
tunistically gathered data points. 


3.3 Consistency 


While our initial observations of monotonicity are quite 
suggestive, we need to consider other possible explana- 
tions and confounding factors as well. Here we evaluate 
the data for internal consistency—the degree to which 
the data appears best explained by the sequential update 
hypothesis rather than other plausible explanations. At 
the end of the paper we also consider the issue of ex- 
ternal consistency using “ground truth” revenue data for 
one program. 
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Sequential update 


The fundamental premise underlying our purchase-pair 
technique is that order numbers increment sequentially 
for each attempted order. The monotone sequences that 
we observe accord with this hypothesis, but could arise 
from other mechanisms. Alternate interpretations in- 
clude that updates are monotone but not sequential (e.g., 
incrementing the order number by a small, varying num- 
ber for each order) or that order numbers are derived 
from timestamps (i.e., that each order number is just 
a normalized representation of the time of purchase, 
and does not reflect the number of distinct purchase at- 
tempts). 

To test these hypotheses, we executed back-to-back 
orders (i.e., within 5-10 seconds of one another) for 
each of the programs under study. We performed this 
measurement at least twice for all programs (except- 
ing EvaPharmacy, which temporarily stopped operation 
during our study). For eight of the programs, every 
measurement pair produced a sequential increment. The 
GlavMed program also produced sequential increments, 
but we observed one measurement for which the order 
number incremented by two, likely simply due to an in- 
tervening order out of our control. Finally, we observed 
no sequential updates for Rx—Promotion even with re- 
peated back-to-back purchase attempts. However, upon 
further examination of 35 purchases, we noticed that or- 
der numbers for this program are always odd; for what- 
ever reason, the Rx—Promotion order processing system 
increments the order number by two for each order at- 
tempt. Adjusting for this deviation, our experiments find 
that on finer time scales, every affiliate program be- 
haves consistently with the sequential update hypothe- 
sis. 

We need however to consider an alternate hypothesis 
for this same behavior: that order numbers reflect nor- 
malized representations of timestamps, with each order 
implicitly serialized by the time at which it is received. 
This “clock” model does not appear plausible for fine- 
grained time scales. Our purchases made several seconds 
apart received sequential order numbers, which would re- 
quire use of a clock that advances at a somewhat peculiar 
rate—slowly enough to risk separate orders receiving the 
same number and violating the uniqueness property. 

A possible refinement to the clock model would be 
for a program to periodically allocate a block of order 
numbers to be used for the next T’ seconds (e.g., for 
T = 3,600), and after that time period elapses, advanc- 
ing to the next available block. The use of such a hybrid 
approach would enable us to analyze purchasing activity 
over fine-grained time scales. But it would also tend to- 
wards misleading over-inflation of such activity on larger 
time scales, since we would be comparing values gener- 
ated across gaps. 
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Figure 2: Order numbers (y-axis) associated with each affiliate 
program versus the time of attempted purchase (x-axis). 


We test for whether the order numbers in our data fit 
with a clock model as follows. First, we consider the 
large-scale behavior of order numbers as seen across the 
different affiliate programs. Figure 2 plots for each pro- 
gram the order number associated with a purchase at- 
tempt made at a given time. We plot each of the 10 af- 
filiate programs with a separate symbol (and varying 
shades, though we reuse a few for programs whose num- 
bers are far apart). In addition, we plot with black points 
the order numbers revealed in the GlavMed discussion 
forum. 

Three basic points stand out from the plot. First, all 
of the programs use order numbers distinct from the oth- 
ers. (We verified that neither of those closest together, 
33drugs and Royal Software, nor Pharmacy Express and 
SoftSales, overlap.) Thus, it is not the case that separate 
affiliate programs share unified order processing. 

Second, the programs nearly always exhibit mono- 
tonicity even across large time scales, ruling out the pos- 
sibility that some programs occasionally reset their coun- 
ters. (We discuss the outliers that manifest in the plot be- 
low.) 

Third, the GlavMed forum data is consistent with our 
own active purchases from GlavMed. In addition, the 
data for both has a clear downward concavity starting 
in 2009—inconsistent with use of clock-driven batches, 
but consistent with the sequential update hypothesis. As- 
suming that the data indeed reflects purchase activity, the 
downward concavity also indicates that the program has 
been losing customers, a finding consistent with main- 
stream news stories [13]. 

We lack such extensive data for the other programs, 
but can still assess their possible agreement with use 
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Figure 3: The amount of error—either in our measurement pro- 
cess, or due to batching of order numbers—required for each 
measurement in 2011 to be consistent with the Null Hypothesis 
that order numbers are derived from a clock that advances at 
some steady rate. Note that the y-axis is truncated at +24 hrs, 
though additional points lie outside this range. 





of clock-driven batches, as follows. For each program, 
we consider the purchases made in 2011. We construct 
a least-squares linear fit between the order numbers of 
the purchases and the time at which we made them. If 
the order numbers come from clock-driven batches (the 
Null Hypothesis), then we would expect that all of the 
points associated with our purchases to fall near the fitted 
line. Accordingly, for each point we compute how far we 
would have to move it along the x-axis so that it would 
coincide with the line for its program. If the Null Hypoth- 
esis is true, then this deviation in time reflects the error 
that must have arisen during our purchase measurement: 
either due to poor accuracy in our own time-keeping, or 
because of the granularity of the batches used by the pro- 
gram for generating order numbers. 

Figure 3 plots this residual error for each affiliate pro- 
gram. For example, in the lower right we see a point for 
a 33drugs purchase made in early February 2011. If the 
Null Hypothesis holds, then the purchaser’s order num- 
ber reflects a value that should have appeared 18 hours 
earlier than when we observed it. That is, either we in- 
troduced an error of about 18 hours in recording the time 
of that purchase; or the program uses a batch-size of 18+ 
hours; or the Null Hypothesis fails to hold. 

For all ten of the affiliate programs, we find many pur- 
chases that require timing errors of many hours to main- 
tain consistency with the Null Hypothesis. (Note that 
we restrict the y-axis to the range +24 hr for legibil- 
ity, although we find numerous points falling outside that 
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range as well.) In addition, we do not discern any tempo- 
ral patterns in the required errors, such as would be the 
case if the least-squares fit was perturbed by an outlier. 
Finally, if we extend the analysis out to November 2010 
(not shown), we find that the required error grows, some- 
times to 100s of hours, indicating that the discrepancy 
does not result from a large batch size such as T’ = 1 day. 
Given this evidence, we reject the Null Hypothesis that 
the order numbers derive from a clock-driven mecha- 
nism. We do however find the data consistent with the 
sequential update hypothesis, and so proceed from this 
point on the presumption that indeed the order numbers 
grow sequentially with each new purchase attempt. 


Payment independence 


We placed most of our orders using cards underwritten 
by Visa. We selected Visa because it is the dominant pay- 
ment method used by these affiliate programs (few accept 
MasterCard, and fewer still process American Express). 
However, it is conceivable that programs allocate distinct 
order number ranges for each distinct type of payment. If 
so, then our Visa-based orders would only witness a sub- 
set of the order numbers, leading us to underestimate the 
total volume of purchase transactions. To test this ques- 
tion, we acquired several prepaid MasterCard cards and 
placed orders at those programs that accept MasterCard 
(doing so excludes Rx—Promotion, GlavMed, 4RX and 
Online Pharmacy). In each case, we found that Visa pur- 
chases made directly before and after a MasterCard pur- 
chase produced order numbers that precisely bracketed 
the MasterCard order numbers as well. 


Outliers 


Out of the 324 samples in our dataset, we found a small 
number of outliers (six) that we discuss here. Almost all 
come from the GlavMed program. The outliers fall into 
two categories: two singleton outliers completely outside 
the normal order number range for the program, and one 
group of four internally consistent order numbers that 
were slightly outside the expected range, violating mono- 
tonicity. We discuss these in more detail here, as well as 
their possible explanations. 

The first singleton outlier was a purchase placed at a 
Web site that is clearly based on the SE2 engine built 
by GlavMed. However, the returned order number was 
close to 16000 when co-temporal orders from all other 
GlavMed sites returned orders closer to 1080000. The 
site differs in a number of key features, including a 
unique template not distributed in the standard package 
made available to GlavMed affiliates, a different support 
phone number, different product pricing, and purchases 
processed via a different acquiring bank than used by 
all other GlavMed purchases. Taken together, we believe 
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this reflects a site that is simply using the SE2 engine, but 
is not in fact associated with the GlavMed operation.> 

The second outlier occurred in a very early (January 
2010) purchase from a Pharmacy Express affiliate, which 
returned an order number much higher than any seen in 
later purchases. We have no clear explanation for this in- 
congruity, and other key structural and payment features 
match, but we note that the order numbers returned in 
all subsequent Pharmacy Express transactions are only 
five digits long, and that over nine months pass between 
this initial outlier and all subsequent purchases. Conse- 
quently, we might reasonably explain the discrepancy by 
a decision to reset the order number space at some point 
between January and October. 

Finally, we find a group of four early GlavMed pur- 
chases whose order numbers are roughly the same mag- 
nitude, but occur out of sequence (i.e., given the rate of 
growth seen in the other GlavMed order numbers, these 
four are from a batch that will only be used sometime 
in 2013). These all occurred together in the last two 
weeks of January 2010. This small outlier group remains 
a mystery, and suggests either that GlavMed might main- 
tain a parallel order space for some affiliates, or that they 
reflect a “counterfeit” GlavMed operation. The remain- 
ing 21 GlavMed purchase samples, as well as the 122 op- 
portunistically gathered order numbers (occurring both 
before and after January 2010), all use consistent order 
numbering. 

While we cannot completely explain these few out- 
liers, they represent less than 2% percent of our dataset. 
We also have found no unexplained instances within the 
last 12 months. We remove these six data points in the 
remainder of our analysis. 


3.4 Order rates 


Under these assumptions, we can now estimate the rate 
of orders seen by each enterprise. Figure 4 plots the 2011 
data points for each of the 10 programs. We also plot 
the least squares linear interpolation as well as the slope 
parameter of this line—corresponding to the number of 
orders received per day on average. During this time pe- 
riod, daily order rates for pharmacy programs vary from 
a low of 227 for Rx—Promotion (recall that their order 
IDs increment by two for each order) up to a high of 887 
for EvaPharmacy (software programs range between 49 
and 749). Together, these reflect a monthly volume of 
over 82,000 pharmaceutical orders and over 37,000 soft- 
ware orders. Again, these numbers reflect upper bounds 
on completed orders, since undoubtedly some fraction of 
these attempted orders are declined; however, it seems 
clear that order volume is substantial. 


5 We have found third parties contracting for custom GlavMed tem- 


plates on popular “freelancer” sites, giving reason to believe that inde- 
pendent innovation exists around the SE2 engine created by GlavMed. 
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We also note that while order volume is quite consis- 
tent across January and February, there are significant 
fall offs for some programs when compared to the data 
gathered earlier. For example, during 2010, the average 
number of Rx—Promotion orders per day was 385, 70% 
greater than during the first two months of 2011. Sim- 
ilarly, 2011 GlavMed orders are off roughly 20% from 
their 2010 pace, and EvaPharmacy saw a similar de- 
cline as compared to October and November of that year. 
Other programs changed little and maintained a stable 
level of activity. 


4 Purchasing behavior 


While the previous analysis demonstrates that pharma- 
ceutical affiliate programs are receiving a significant vol- 
ume of orders, it reveals little about the source of these 
orders or their contents. In this section, we use an oppor- 
tunistic analysis of found server log data to explore these 
issues for one such affiliate program. 


4.1 EvaPharmacy image hosting 


In particular, we examine EvaPharmacy, a “top 5” spam- 
advertised pharmacy affiliate program.® In monitoring 
EvaPharmacy sites we observed that roughly two thirds 
“outsourced” image hosting to compromised third-party 
servers (typically functioning Linux-based Web servers). 
This behavior was readily identifiable because visits to 
such sites produced HTML code in which each image 
load was redirected to another server—addressed via raw 
IP address—at port 8080. 

We contacted the victim of one such infection and they 
were able to share IDS log data in support of this study. 
In particular, our dataset includes a log of HTTP request 
streams for a compromised image hosting server that 
was widely used by EvaPharmacy sites over five days 
in August of 2010. While the raw IP addresses in our 
dataset have been anonymized (consistently), they have 
first been geolocated (using MaxMind) and these geo- 
graphic coordinates are available to us. Thus, we have 
city-level source identifiability as well as the contents of 
HTTP logs (including timestamp, object requested, and 
referrer). 

Through repeated experimentation with live Eva- 
Pharmacy sites, we inferred that the site “engine” can use 
dynamic HTML rewriting (similar to Akamai) to rewrite 
embedded image links on a per visit basis. On a new 
visit (tracked via a cookie), the server selects a set of 
five compromised hosts and assigns these (apparently in 
a quasi-random fashion) to each embedded image link 
served. During the five-day period covering our log data, 
our crawler observed 31 distinct image servers in use. 


Our page classifiers [16] identified EvaPharmacy in over 8% of 
pharmacy sites found in spam-advertised URLs over three months, with 
affiliates driving traffic to over 11,000 distinct domains. 
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Figure 4: Collected data points and best fit slope showing the inferred order rate for ten different spam-advertised affiliate programs. 
Order numbers are zero-normalized and the vertical scale of each plot is identical. 


However, our particular server was apparently dispropor- 
tionately popular, as it appears in 31% of all contempo- 
raneous visits made by our URL crawler (perhaps due 
to its particularly good connectivity). In turn, each im- 
age server hosts an nginx Web proxy able to serve the 
entirety of the image corpus. 


4.2 Basket inference 


Since the log we use is limited to embedded Web page 
images, and in fact only includes one fifth of the images 
fetched during a particular visit, there are considerable 
challenges involved in inferring item selection purely 
from this data. We next discuss how this inference tech- 
nique works (illustrated at a high level in Figure 5) as 
well as its fundamental limitations.’ 

We mapped out the purchasing workflow involved in 
ordering from an EvaPharmacy site, and observed that all 
purchases involve visiting four key kinds of pages in or- 
der: landing, product, shopping cart, and checkout. The 
landing page generally includes over 40 distinct embed- 
ded images. Thus, even though images are split among 
five servers, it is highly likely that multiple objects from 
each landing page are fetched via our server (each with 
a referrer field identifying the landing page from which 
it was requested).® We observe 752,000 distinct IP ad- 


7This general approach is similar in character to Moore and Clay- 
ton’s inference of phishing page visits from Webalizer logs [20]. 

8We validated this observation using our crawled data, which 
showed that the landing pages using :8080 image hosting always used 
five distinct servers. Thus, any image server assigned to a particular 
visit is guaranteed to see the landing page load for that visit. 
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dresses that visited and included referrer information 
during our five-day period. 

When a visitor selects a particular drug from the land- 
ing page, the reply takes them to an associated product 
page. This page in turn prompts them to select the par- 
ticular dosage and quantity they wish to purchase. The 
precise construction of product pages differs between the 
set of site templates (i.e., storefront brands) used by Eva- 
Pharmacy. However, all include at least a few new im- 
ages not found on the landing page, and the most popu- 
lar template fetches five additional images. The number 
of additional images varies on a per-template basis, not 
a per-product basis within each template. Thus, for some 
templates we may have less opportunity to observe what 
product the user selects, but this does not affect our esti- 
mate of the distribution of products selected, because the 
diminished opportunity is not correlated with particular 
products. 

Next, upon selecting a product, the user is taken to the 
shopping cart page, which again includes a large number 
(often a dozen or more) of new images representing prod- 
uct recommendations. We observe 4,879 cart visits from 
3,872 distinct IP addresses. This allows us to estimate 
a product-selection conversion rate: the fraction of visi- 
tors who select an item for purchase. Based on the total 
number of visitors where we have referrer information, 
the conversion percentage on an IP basis is 0.5%.? Of 
these, 3,089 cart additions have preceding visits to prod- 


°For comparison, in our previous work we measured a visit-to- 
product-selection conversion rate of 2% [10]. 
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Figure 5: How a user interacts with an EvaPharmacy Web site, beginning with the landing page and then proceeding to a product 
page and the shopping cart. The main Web site contains embedded images hosted on separate compromised systems. When a 
browser visits such pages, the referrer information is sent to the image hosting servers for every new image visited. 


uct pages, which allows us to infer the selected product. 
To quantify overall shopping cart addition activity, we 
compare the total number of visits to the number of vis- 
its to the shopping cart page. To quantify individual item 
popularity, we examine the subset of visits for which the 
customer workflow allows us to infer which specific item 
was added to the cart. 

There are three key limitations to this approach. 
First and foremost, the final page in the purchasing 
workflow—the checkout page—generally does not in- 
clude unique image content, and thus does not appear in 
our logs (even if it did, our approach could not determine 
whether checkout completed correctly). Thus, we can 
only observe that a user inserted an item into their cart, 
but not that they completed a purchase attempt. In gen- 
eral, this is only an issue to the degree that shopping cart 
abandonment correlates with variables of interest (e.g., 
drug choice). The second limitation is that pages typi- 
cally use the same image for all dosages and quantities 
on a given product page, and therefore we cannot distin- 
guish these features (e.g., we cannot distinguish between 
a user selecting 120 tablets of 25mg Viagra tablets vs. 
an order of 10 tablets, each of 100mg). Finally, we can- 
not disambiguate multiple items selected for purchase. 
When a user visits a product page followed by the shop- 
ping cart page, we can infer that they selected the associ- 
ated product. However, if the visitor then continues shop- 
ping and visits additional product pages, we cannot de- 
termine whether they added these products or simply ex- 
amined them (subsequent visits to the shopping cart page 
add few new recommended products; recommendations 
appear based on the first item in the cart). We choose 
the conservative approach and only consider the products 
that we are confident the user selected, which will cause 
us to under-represent those drugs typically purchased to- 
gether. 

Another issue is that pharmacy formularies, while 
largely similar, are not identical between programs. In 
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particular, some pharmacy programs (e.g., Online Phar- 
macy) offer Schedule I drugs (e.g., Oxycodone and Vi- 
codin). However, since EvaPharmacy does not sell such 
drugs, our data does not capture this category of demand. 

Finally, our dataset also has potential bias due to the 
particular means used to drive traffic to it. We found 
that 45 of the 50 top landing pages observed in the host- 
ing data also appeared in our spam-driven crawler data, 
demonstrating directly that these landing pages were ad- 
vertised through email spam. While these pages could 
also be advertised using less risky methods such as 
SEO, this seems unlikely since spam-advertised URLs 
are swiftly blacklisted [14]. Thus, we suspect (but cannot 
prove) that our data may only capture the purchasing be- 
havior for the spam-advertised pharmacies; different ad- 
vertising vectors could conceivably attract different de- 
mographics with different purchasing patterns. 

Given these limitations, we now report the results 
of two analyses: product popularity (what customers 
buy) and customer distribution (where the money comes 
from). 


4.3 Product popularity 


Our first analysis focuses on simple popularity: what in- 
dividual items users put into their shopping carts (Ta- 
ble 3a) and what broad (seller-defined) categories of 
pharmaceuticals were popular (Table 3b) during our 
measurement period. Although naturally dominated by 
the various ED and sexually-related pharmaceuticals, we 
find a surprisingly long tail; indeed, 38% of all items 
added to the cart were not in this category. We observed 
289 distinct products, including popular mass-market 
products such as Zithromax (31), Acomplia (27), Nex- 
ium (26), and Propecia (27); but also Cipro (11; a com- 
monly prescribed antibiotic), Actos (6; a treatment for 
Type 2 diabetes), Buspar (12; anti-anxiety), Seoquel (9; 
anti-schitzophrenia), Clomid (8; ovulation inducer), and 
Gleevec (1; used to treat Leukemia and other cancers). 
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Figure 6: The geographic distribution of those who added an 
item to their shopping cart. 





= Cart Added 
Cony Visits Additions Product 
United States 517,793 3,707 0.72% 
Canada 50,234 218 0.43% 
Philippines 42,441 39 0.09% 
United Kingdom 39,087 131 0.34% 
Spain 26,968 59 0.22% 
Malaysia 26,661 31 0.12% 
France 18,541 37 0.20% 
Germany 15,726 56 0.36% 
Australia 15,101 86 0.57% 
India 10,835 17 0.16% 
China 8,924 30 0.34% 
Netherlands 8,363 21 0.25% 
Saudi Arabia 8,266 36 0.44% 
Mexico 7,775 17 0.22% 
Singapore 7,586 17 0.22% 


Table 2: The top 15 countries and the percentage of visitors 
who added an item to their shopping cart. 


This in turn explains why such online pharmacies 
maintain a comprehensive inventory: not only does a full 
formulary lend legitimacy, but it also represents a signif- 
icant source of potential revenue. 

We also comprehensively crawled an EvaPharmacy 
site for pricing data and calculated the minimum esti- 
mated revenue per purchase (also shown for the top 18 
products in Table 3a). Combining this data with our mea- 
surement of item popularity, we calculate a minimum 
weighted-average item cost of $76 plus $15 for shipping 
and handling. This weighted average assumes visitors al- 
ways select the minimum-priced item for any given pur- 
chase, and that the final purchases have the same distri- 
bution as for items added to the user’s shopping cart. 


4.4 Customer distribution 


We next examine the geographic component of the Eva- 
Pharmacy customer base. Figure 6 shows the geolocated 
origin for all shopping cart additions. We observe that 
EvaPharmacy has a vast advertising reach, producing site 
visits from 229 distinct countries or territories. However, 
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this reach is not necessarily all that useful: the population 
actively engaging with EvaPharmacy sites and placing 
orders is considerably less diverse than the superset sim- 
ply visiting (perhaps inadvertently or due to curiosity). 
For example, the Philippines constitutes 4% of the vis- 
itors, but only 1% of the additions to the shopping cart. 
Overall, countries other than the U.S., Canada, and West- 
ern Europe generate 29% of the visitors but only 13% of 
the items added to the shopping cart. Conversely, the vast 
majority of shopping cart insertions originate from the 
U.S. and Canada (80%) or Europe (6%), reinforcing the 
widely held belief that spam-advertised pharmaceuticals 
are ultimately funded with Western Dollars and Euros. 

The United States dominates both visits (54%) and 
cart additions (76%), and moreover has the highest rate 
of conversion between visit and shopping cart insertion 
(0.72%). Table 2 well illustrates this, listing the activ- 
ity from the countries originating the most visits. This 
observation reinforces the conclusion that non-Western 
audiences offer ineffective targets for such advertising. 

Finally, we also notice significant differences be- 
tween the drug selection habits of Americans com- 
pared to customers from Canada and Western Europe. 
In particular, we divide the EvaPharmacy formulary 
into two broad categories: lifestyle drugs (defined as 
drugs commonly used recreationally, including “male- 
enhancement” items plus Human Growth Hormone, 
Soma and Tramadol) and non-lifestyle (all others, in- 
cluding birth control pills). We find that while U.S. cus- 
tomers select non-lifestyle items 33% of the time, Cana- 
dian and Western-European customer selections concen- 
trate far more in the lifestyle category—only 8% of all 
items placed in a shopping cart are non-lifestyle items. 
We surmise that this discrepancy may arise due to differ- 
ences in health care regimes; drugs easily justified to a 
physician may be fully covered under state health plans 
in Canada and Western Europe, leaving an external mar- 
ket only for lifestyle products. Conversely, a subset of 
uninsured or under-insured customers in the U.S. may 
view spam-advertised, no-prescription-required pharma- 
cies as a competitive market for meeting their medical 
needs. To further underscore this point, we observe that 
85% of all non-lifestyle drugs are selected by U.S. visi- 
tors. 


5 Revenue estimation 


Combining the results from estimates on the order rate 
per program and estimates of the shopping cart makeup, 
we now estimate total revenue on a per-program basis. 


5.1 Average price per order 


The revenue model underlying our analysis is simple: we 
multiply the estimated order rate by the average price per 
order to arrive at a total revenue figure over a given unit 
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Product Quantity Min order Category Quantity 
Generic Viagra 568 $78.80 Men’s Health 1760 
Cialis 286 $78.00 Pain Relief 232 
Cialis/Viagra Combo Pack 172 $74.95 Women’s Health 183 
Viagra Super Active+ 121 $134.80 General Hearth 135 
Female (pink) Viagra 119 $44.00 Antibiotics 134 
Human Growth Hormone 104 $83.95 Antidepressants 95 
Soma (Carisoprodol) 99 $94.80 Weight Loss 92 
Viagra Professional 87 $139.80 Allergy & Asthma 85 
Levitra 83 $100.80 Heart & Blood Pressure 72 
Viagra Super Force 81 $88.80 Skin Care 54 
Cialis Super Active+ 72 $172.80 Stomach 41 
Amoxicillin 47 $35.40 Mental Health & Epilepsy 33 
Lipitor 38 $14.40 Anxiety & Sleep Aids 33 
Ultram 38 $45.60 Diabetes 22 
Tramadol 36 $82.80 Smoking Cessation 22 
Prozac 35 $19.50 Vitamins and Herbal Suppliments 18 
Cialis Professional 33 $176.00 Eye Care 15 
Retin A 31 $47.85 


(a) 


Anti- Viral 14 
(b) 


Table 3: Table (a) shows the top 18 product items added to visitor shopping carts (representing 66% of all items added). Table (b) 
shows the top 18 seller-defined product categories (representing 99% of all items). 


of time. However, we do not know, on a per-program ba- 
sis, the actual average purchase price. Thus, we explore 
three different approximations, all of which we believe 
are conservative. 

First, for on-line pharmacies we use the static value of 
roughly $100 as reported in our previous “Spamalytics” 
study [10]. However, this study only considered one par- 
ticular site, covered only 28 customers, and was unable 
to handle more than a single item placed in a cart (i.e., 
it could not capture information about customers buying 
multiple items). 

We also consider a second approximation based on the 
minimum priced item (including shipping) on the site for 
each program under study. Since sites can have enormous 
catalogs, we restrict the set of items under considera- 
tion as follows. For pharmacy sites, we consider the top 
18 most popular items as determined by the analysis of 
EvaPharmacy in § 4 (these top 18 items constituted 66% 
of order volume in our analysis). For each of these items 
present in the target pharmacy, we find the minimum- 
priced instance (i.e., lowest dosage and quantity) and use 
the overall minimum as our per-order price. For small 
deviations between pharmacy formularies (e.g., differ- 
ent Viagra store-brand variants) we simply substitute one 
item for the other. We repeat this same process for soft- 
ware, but since we do not have a reference set of most 
popular items for this market, we simply use the de- 
clared “bestsellers” at each site (16 at Royal Software, 
36 and SoftSales and 76 at EuroSoft)—again using the 
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minimum priced item to represent the average price per 
order. 

Finally, we calculate a “basket-weighted average” 
price using measured popularity data. For pharmacies we 
again consider the 18 most popular EvaPharmacy items 
and extract the overlap set with other pharmacies. Us- 
ing the relative frequency of elements in this intersec- 
tion, we calculate a popularity vector that we then use 
to weight the minimum item price; we use the sum of 
these weights as the average price per order. Intuitively, 
this approach tries to accommodate the fact that prod- 
uct’s have non-uniform popularity, while still using the 
conservative assumption that users order the minimum 
dosage and quantity for each item. Note that we implic- 
itly assume that the distribution of drug popularity holds 
roughly the same between online pharmacies.!° 

We repeated this analysis, as before, with site-declared 
best-selling software packages. To gauge relative popu- 
larity, we searched a large BitTorrent metasearch engine 
(isohunt.com), which indexes 541 sites tracking over 
6.5 million torrents. We assigned a popularity to each 
software item in proportion to the sum of the seeders and 
leechers on all torrents matching a given product name. 
We then weighted the total prices (inclusive of any han- 
dling charge) by this popularity metric to arrive at an es- 
timate of the average order price. 


!0One data point supporting this view is Rx—Promotion’s rank- 
ordered list of best selling drugs. The ten most popular items sold by 
both pharmacies are virtually the same and ranked in the same order. 


USENIX Association 


Spamalytics 


Min product price Basket-weighted average 





enlite Progen cndess/acnts single order rev/month single order _—rev/month single order _— rev/month 
33drugs 9,862 $100 $980,000 $45.00 $440,000 $57.25 $560,000 
4RX 8,001 $100 $800,000 $34.50 $280,000 $95.00 $760,000 
EuroSoft 22,776 N/A N/A $26.50 $600,000 $84.50 $1,900,000 
EvaPharmacy 26,962 $100 $2,700,000 $50.50 $1,300,000 $90.00 $2,400,000 
GlavMed 17,933 $100 $1,800,000 $54.00 $970,000 $57.00 $1,000,000 
Online Pharmacy 5,856 $100 $590,000 $37.00 $220,000 $58.00 $340,000 
Pharmacy Express 7,933 $100 $790,000 $51.00 $410,000 $58.75 $460,000 
Royal Software 13,483 N/A N/A $55.25 $750,000 $133.75 $1,800,000 
Rx—Promotion 6,924 $100 $690,000 $45.00 $310,000 $57.25 $400,000 
SoftSales 1,491 N/A N/A $20.00 $30,000 $134.50 $200,000 


Table 4: Estimated monthly order volume, average purchase price, and monthly revenue (in dollars) per affiliate program using 


three different per-order price approximations. 


5.2 Revenue 


Finally, to place a rough estimate on revenue, we multi- 
ply the 2011 order volume measurements shown in Fig- 
ure 4 against each of the previously mentioned approxi- 
mations, summarized in Table 4. In general, the approxi- 
mation from our prior “Spamalytics” study is the largest, 
followed by basket-weighted average and then minimum 
product price. However, for pharmaceutical programs 
the difference between product prices is not large, and 
thus the minimum and basket-weighted estimates all lie 
within 2X of one another. Software programs see much 
more variation in price, and hence the difference between 
the minimum and basket-weighted revenue estimates can 
be substantial. 

Using the basket-weighted approximation, we find 
that both GlavMed and EvaPharmacy produce revenues 
in excess of $1M per month, with all but two over $400K. 
Surprisingly, software sales also produce high revenue— 
less due to high prices than high order volumes. It re- 
mains for future work how to further validate how closely 
order volumes track successfully completed orders for 
this market niche. 


5.3. External consistency 


While we put considerable care into producing these es- 
timates, a number of biases remain unavoidable. First, 
while our order volume data has internal consistency 
(and consistency with order number implementations in 
common shopping cart software), we could not capture 
the impact of order declines. Thus, we have a somewhat 
optimistic revenue estimate, since surely some fraction 
of orders will not complete. 

On the other hand, our estimates of average order rev- 
enue are themselves conservative in several key ways. 
First, they assume that all purchasers select only a sin- 
gle item. Second, they assume that when purchasing an 
item, all users select the minimum dosage and quantity. 
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Finally, for pharmaceuticals we need to keep in mind 
that EvaPharmacy does not carry “harder” drugs found 
at other sites, such as Schedule I opiates. We have found 
anecdotal evidence that these drugs are highly popular 
at such sites, but our methodology does not offer any 
means to consider their impact. Such items are also typi- 
cally more expensive than other drugs (e.g., the cheapest 
Hydrocodone order possible at one popular pharmacy is 
$186 plus shipping). Thus, this other factor will cause us 
to underestimate the true revenue per order. 

Our intuition is that such factors are modest, and 
our estimates capture—within perhaps a small constant 
factor—the true level of financial activity within each 
enterprise. However, absent ground truth data for pro- 
gram revenues, it is not generally possible to validate our 
model and hence verify that our measurements actually 
capture reality. In general, this kind of validation is rarely 
possible since the actors involved are not public compa- 
nies and do not make revenue statements available. 

Due to an unusual situation, however, we were able 
to acquire such information for one program, Rx- 
Promotion. In particular, a third party made public a va- 
riety of information, including multiple months of ac- 
counting data, for Rx—Promotion’s payment processor.'! 
While we cannot validate the provenance of this data, 
its volume and specificity make complete fabrication un- 
likely. In addition, given that our research covers only a 
small subset of this data, it seems further unlikely that 
any fabrication would closely match our own indepen- 
dent measurements. 

Unfortunately, we do not have payment ledgers pre- 
cisely covering our 2011 measurement period. Instead, 
we compare against a similar period six months ear- 
lier for which we do have ground truth documentation, 
27 consecutive days from the end of Spring, 2010. These 


'l While our legal advisers believe that the prior public disclosure of 


this data allows its use in a research context, we chose not to unneces- 
sarily antagonize the payment services provider by naming them here. 


20th USENIX Security Symposium 231 


232 


two periods are comparable because during both times 
Rx-—Promotion had significant difficulty processing or- 
ders on “controlled” drugs (indeed, during the 2011 pe- 
riod such drugs had been removed from the standard for- 
mulary on Rx—Promotion affiliates).!” 

Based on this data, we find that between May 31 and 
June 26, 2010, Rx—Promotion’s turnover via electronic 
payments was $609K.!> Using our estimate of 385 orders 
per day in 2010 (see § 3), this is consistent with an aver- 
age revenue per order of $58, very similar to our basket- 
weighted average order price estimate of $57. While we 
suspect that both estimates are likely off (with the num- 
ber of true June 2010 orders likely less due to declines, 
and January 2011 price-per-order likely higher due to 
conservatism in our approximation), they are sufficiently 
close to one another to support our claim that this ap- 
proach can provide a rough, but well-founded estimate 
(i.e., within a small constant factor) of program revenue. 


6 Conclusion 


When asked why he robbed banks, Willie Sutton fa- 
mously responded, “Because that’s where the money 
is.’ The same premise is frequently used to explain the 
plethora of unwanted spam that fills our inboxes, pol- 
lutes our search results and infests our social networks— 
spammers spam because they can make money at it. 
However, a key question has long been how much money, 
and from whom? In this paper we provide what we be- 
lieve represents the most comprehensive attempt to an- 
swer these questions to date. We have developed new in- 
ference techniques: one to estimate the rate of new orders 
received by the very enterprises whose revenue drives 
spam, and the other to characterize the products and cus- 
tomers who provide that same revenue. We provide quan- 
titative evidence showing that spam is ultimately sup- 
ported by Western purchases, with a particularly central 
role played by U.S. customers. We also provide the first 
sense of market size, with well over 100,000 monthly 
orders placed in our dataset alone. Finally, we provide 
rough but well-founded estimates of per-program rev- 
enue. Our results suggest that while the spam-advertised 
pharmacy market is substantial, with annual revenue in 
the many tens of millions of dollars, it has nowhere near 
the size claimed by some, and indeed falls vastly short of 
the annual expenditures on technical anti-spam solutions. 


!2During periods when such drugs were sold en masse, the overall 
Rx-—Promotion revenue was frequently doubled. 

13Interestingly, this data also provides useful information about re- 
funds and chargebacks (together about 10% of revenue) as well as 
processing fees (roughly 8.5%). Thus, the gross revenue delivered to 
Rx—Promotion in June 2010 was likely closer to $489K. Finally, since 
roughly 40% of successful order income is paid to affiliates on a com- 
mission basis, that leaves only $270K (44% of gross) for fulfillment, 
administrative costs, and profit. 
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ABSTRACT 


This paper presents the first wireless pairing protocol 
that works in-band, with no pre-shared keys, and protects 
against MITM attacks. The main innovation is a new key 
exchange message constructed in a manner that ensures 
an adversary can neither hide the fact that a message was 
transmitted, nor alter its payload without being detected. 
Thus, any attempt by an adversary to interfere with the 
key exchange translates into the pairing devices detect- 
ing either invalid pairing messages or an unacceptable 
increase in the number of such messages. We analytically 
prove that our design is secure against MITM attacks, 
and show that our protocol is practical by implementing a 
prototype using off-the-shelf 802.11 cards. An evaluation 
of our protocol on two busy wireless networks (MIT’s 
campus network and a reproduction of the SIGCOMM 
2010 network using traces) shows that it can effectively 
implement key exchange in a real-world environment. 


1 INTRODUCTION 


Recent trends in the security of home WiFi networks are 
driven by two phenomena: ordinary users often strug- 
gle with the security setup of their home networks [14], 
and, as a result, some of them end up skipping security 
activation [19, 26]. Simultaneously, there is a prolifera- 
tion of WiFi gadgets and sensors that do not support an 
interface for entering a key. These include WiFi sound 
systems, medical sensors, USB keys, light and tempera- 
ture sensors, motion detectors and surveillance sensors, 
home appliances, and game consoles. Even new models 
of these devices are unlikely to support a keypad because 
of limitations on their form factor, style, cost, or func- 
tionality. Responding to these two requirements—easing 
security setup for home users, and securing devices that 
do not have an interface for entering a key—the WiFi 
Alliance has introduced the Push Button Configuration 
(PBC) mechanism [26]. To establish a secure connec- 
tion between two WiFi devices, the user pushes a button 
on each device, and the devices broadcast their Diffie- 
Hellman public keys [7], which they then use to protect 
all future communication. PBC is a mandatory part of 
the new WiFi Protected Setup certification program [27]. 
It is already adopted by the major WiFi manufacturers 
(e.g., Cisco, NetGear, HP, Microsoft, Sony) and imple- 
mented in about 2,000 new products from 117 different 
companies [25]. 

Unfortunately, the PBC approach taken by the WiFi 
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Alliance does not fully address WiFi security. Diffie- 
Hellman’s key-exchange protocol [7] protects against only 
passive adversaries that snoop on the wireless medium to 
obtain key exchange messages. Since the key exchange 
messages are not authenticated in any way, the protocol is 
vulnerable to an active man-in-the-middle (MITM) attack. 
That is, an adversary can impersonate each device to 
the other, convincing both devices to establish a secure 
connection via the adversary. With WiFi increasingly 
used in medical sensors that transmit a patient’s vital 
signals [11] and surveillance sensors that protect one’s 
home [16, 21], there is a concern that, being vulnerable 
to MITM attacks, PBC may give users a false sense of 
security [15, 26]. 

One may wonder why the WiFi Alliance did not 
adopt a user-friendly solution that also protects against 
MITM attacks. We believe the reason is that exist- 
ing user-friendly solutions to MITM attacks require de- 
vices to support an out-of-band communication chan- 
nel [6, 10, 17, 18, 20, 22]. For example, devices can 
exchange keys over a visual channel between an LCD and 
a camera [18], an audio channel [10], an infrared chan- 
nel [2], a dedicated wireless channel allocated exclusively 
for key exchange [6], etc. Given the cost, size, and capa- 
bility constraints imposed on many WiFi products, it is 
difficult for the industry to adopt a solution that requires 
an out-of-band communication channel. 

This paper presents tamper-evident pairing (TEP), a 
novel protocol that provides simple, secure WiFi pairing 
and protects against MITM attacks without an out-of-band 
channel. TEP can also be incorporated into PBC devices 
and existing WiFi chipsets without hardware changes. 

TEP’s main challenge in avoiding MITM attacks comes 
from operating on a shared wireless network, where an 
adversary can mask an attack behind cross traffic, making 
it difficult to distinguish an adversary’s actions from legit- 
imate traffic patterns. To understand this, consider a key 
exchange between Alice and Bob, where Bob sends his 
Diffie-Hellman public key to Alice. Lucifer, the adversary, 
could tamper with this key exchange as follows: 


e Collision: Lucifer can jam Bob’s message, causing a 
collision, which would not look out-of-the-ordinary on 
a busy wireless network. The collision prevents Alice 
from decoding Bob’s message. Lucifer can now send 
his own message to Alice, in lieu of Bob’s message, 
perhaps with the help of a directional antenna so that 
Bob does not notice the attack. 
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Payload 
packet 


CTS_to_SELF ON-OFF slots 


110101...... 01 Time 


Figure 1: The format of a tamper-evident announcement (TEA). 


Capture effect: Lucifer can transmit simultaneously 
with Bob, but at a significantly higher power, to produce 
a capture effect at Alice [24]. In this case, Alice will 
decode Lucifer’s message, in which he impersonates 
Bob, despite Bob’s concurrent transmission. Bob will 
not know about Lucifer’s transmission. 

e Timing control: Lucifer can try to impersonate Alice 
by continuously occupying the wireless medium after 
Bob sends out his key, so that Lucifer can send out a 
message pretending to be Alice, but Alice does not get 
a chance to send her legitimate key. 


To address these attacks in TEP, we introduce a tamper- 
evident announcement (TEA) primitive. The key charac- 
teristics of a TEA message is that an attacker can neither 
hide a TEA transmission from other nodes within radio 
range, nor can it modify the content of the TEA without 
being detected. Thus, a TEA provides stronger guarantees 
than payload integrity because it also protects the fact that 
a message was transmitted in the first place. 

Fig. | shows the structure of a TEA. First, to ensure that 
Lucifer cannot mask Bob’s TEA message by introducing 
a collision, the TEA starts with an exceptionally long 
packet. Since standard WiFi collisions are significantly 
shorter, Alice needs to detect only exceptionally long 
collisions (i.e., exceptionally long bursts of energy) as 
potential attacks on the key exchange process. 

Second, to ensure that Lucifer cannot alter the pay- 
load of Bob’s TEA by transmitting his own message at 
a high power to create a capture effect, we force any 
TEA message to include silence periods. As shown in 
Fig. 1, the payload of the TEA message is followed by a 
sequence of short equal-size packets, called slots, where 
the transmission of a packet is interpreted as a “1” bit, 
and an idle medium is interpreted as a “0” bit. The bit 
sequence produced by the slots must match a hash of the 
TEA payload. If Lucifer overwrites Bob’s message with 
his own, he must transmit slots corresponding to a hash 
of his message, including staying silent during any zero 
hash bits. However, since the hash of Lucifer’s message 
differs from that of Bob’s message, Bob’s message will 
show up on the medium during Lucifer’s “0” slots. Alice 
will detect a mismatch between the slots and the message 
hash and reject Lucifer’s message. 

Third, to ensure that legitimate nodes do not mess up 
the timing of Alice and Bob’s key exchange, the TEA 
message includes a CTS-to-SELF, as shown in Fig. 1. 
CTS-to-SELF is an 802.11 message that requires honest 
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nodes to refrain from transmitting for a time period spec- 
ified in the packet. TEP leverages this message for two 
goals. First, it uses it to reserve the medium for the dura- 
tion of the TEA slots to ensure that legacy 802.11 nodes, 
unaware of the structure of a TEA message, do not sense 
the medium as idle and transmit during a TEA’s silent 
slots. Second, TEP also uses CTS-to-SELF to reserve the 
medium for a short period after the TEA slots, to enable 
Alice to send her key to Bob within the interval allowed 
by PBC. Once Alice starts her transmission, the medium 
will be occupied, and honest 802.11 nodes will abstain 
from transmitting concurrently. If Lucifer transmits dur- 
ing the reserved time frame, Alice will still transmit her 
TEA message, and cause a collision, and hence an invalid 
TEA message that Bob can detect. 

We build on TEA to develop the TEP pairing protocol. 
TEP exploits the fact that any attempts to alter or hide 
a TEA can be detected. Thus, given a pairing window, 
any attempt by an adversary to interfere with the pairing 
exchange translates into either an increase in the number 
of TEA messages or some invalid TEA messages. This 
allows the pairing devices to detect the attack and indicate 
to the user that pairing has failed and that she should 
retry. The cost of such a mechanism is that the user has to 
wait for a pre-determined duration of the pairing window. 
In 85.4, we describe how one may eliminate this wait by 
having a user push the button on a device a second time. 

This paper formalizes the above ideas to address possi- 
ble interactions between the pairing devices, adversaries, 
and other users of the medium, and formally proves that 
the resulting protocol is secure against MITM attacks. 
Further, we build a prototype of TEP as an extension to 
the AthSk driver [1], and evaluate it using off-the-shelf 
802.11 Atheros chipsets. Our findings are as follows: 


TEP can be accurately realized using existing OS and 
802.11 hardware. Specifically, our prototype sender 
can schedule silent and occupied slots at a resolution 
of 40s, and its 95“ percentile scheduling error is as 
low as 1.65us. Our prototype receiver can sense the 
medium’s occupancy over periods as small as 20s and 
can distinguish occupied slots (“1” bits) from silent 
slots (“0” bits) with a zero error rate. 

Results from running the protocol on our campus net- 
work and applying the traces from the network during 
the SIGCOMM 2010 conference, show that TEP never 
confuses honest 802.11 traffic for an attack. Further- 
more, though our implementation is for 802.11, it can 
coexist with nearby Bluetooth devices which do not 
respect TEP silent slots. In this case, TEP can still 
perform a key exchange using 1.4 attempts, on average. 


Contributions: This paper presents, to our knowledge, 
the first wireless pairing protocol that defeats MITM at- 
tacks without any key distribution or out-of-band channels. 
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It does so by introducing TEA, a new key exchange mes- 
sage constructed in a manner that ensures an adversary 
can neither hide the fact that a message was transmitted, 
nor alter its payload without being detected. Our proto- 
col is prototyped using off-the-shelf 802.11 devices and 
evaluated in production WiFi networks. 


2 RELATED WORK 


There has been a lot of interest in user-friendly secure 
wireless pairing, which has led to a number of innovative 
solutions [2, 6, 10, 17, 18, 20, 22]. TEP builds on this 
foundational work. However, TEP is the first to provide a 
secure pairing scheme that defeats MITM attacks without 
out-of-band channels, or key distribution or verification. 

Closest to TEP is the work on integrity codes [5], which 
protects the integrity of a message’s payload by inserting 
a particular pattern of ON-OFF slots. Integrity codes, 
however, assume a dedicated out-of-band wireless chan- 
nel. In contrast, on shared channels, honest nodes may 
disturb the ON-OFF pattern by acquiring the medium 
during the OFF slots. Further, the attacker can hide the 
fact that a message was transmitted altogether, by using 
collisions or a capture effect. We build on integrity codes, 
but introduce TEA, a new communication primitive that 
not only protects payload integrity but also ensures that an 
attacker cannot hide that a message was transmitted. We 
further construct TEP by integrating TEA with the 802.11 
standard, the PBC protocol, and the existing OS network 
stack. Finally, we implement TEP on off-the-shelf WiFi 
devices and evaluate it in operational networks. 

TEP is also related to work on secure pairing, which 
traditionally required the user to either enter passwords 
or PINs [3, 4, 12], or distribute public keys (e.g., STS [8], 
Radius in 802.111 [13], or any other public key infras- 
tructure). These solutions are appropriate for enterprise 
networks and for a certain class of home users who are 
comfortable with security setup. However, the need to 
ease security setup for non-technical home users has moti- 
vated multiple researchers to propose alternative solutions 
for secure pairing. Most previous solutions use a trusted 
out-of-band communication channel for key exchange. 
The simplest channel is a physical wired connection be- 
tween the two devices. Other variants of out-of-band 
channels include the use of a display and a camera [18], 
an audio-based channel [10], an infra-red channel [2], 
a tactile channel [22], or an accelerometer-based chan- 
nel [17]. While these proposals protect against MITM 
attacks, many devices cannot incorporate such channels 
due to size, power, or cost limitations. In contrast, TEP 
eases the security setup for home users and defeats MITM 
attacks, without any out-of-band channel. 

Finally, multiple user studies [14, 19, 26] have empha- 
sized the difficulty in pairing devices for ordinary users. 
Our work is motivated by these studies. TEP requires the 
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Figure 2: A timeline depicting the operation of Push Button Con- 
figuration (PBC) between an enrollee and a registrar. 


user to just push a button on each device—exactly as in 
PBC—and does not require any additional user involve- 
ment in key generation or verification. 


3. PBC AND 802.11 BACKGROUND 


3.1. Push Button Configuration 


The WiFi-Alliance introduced the Push Button Configura- 
tion (PBC) mechanism to ease the security setup process 
for ordinary users, and to deal with devices that do not 
have an interface to enter passwords or PINs. In this 
section, we provide an overview of how PBC works. 

Consider a home user who wants to associate an en- 
rollee (PBC’s term for the new device, e.g., a gaming 
console) with a registrar (PBC’s term for, effectively, 
the access point). The user first pushes a button on the 
enrollee and then, within 120 seconds (called the walk 
time), pushes the button on the registrar. Once the but- 
tons are pushed on the two devices, the devices perform a 
Diffie-Hellman key exchange to establish a secret key. 

As shown in Fig. 2, once the button is pushed on the en- 
rollee, it periodically sends probes [26] requesting replies 
from registrars whose PBC button has been pressed. Once 
the enrollee receives a reply, it makes a note of the reply 
and continues to scan all the 802.11 channels for addi- 
tional replies. If the enrollee receives replies from more 
than one registrar, across all 802.11 channels, it raises a 
session overlap error, indicating that the user should try 
again later. On the other hand, if it receives a reply from 
only one registrar, it proceeds with the registration proto- 
col, using the Diffie-Hellman key from that one reply. 

A registrar, for its part, stays on its dedicated channel, 
and replies to probe requests only if the user has pushed 
its PBC button. Once the button is pushed, the registrar 
replies to PBC requests from potential enrollees. To detect 
conflicts, the registrar checks for requests in the last 120 
seconds. If there are requests from more than one enrollee, 
the registrar signals a session overlap error and refuses 
to perform the PBC registration protocol, requiring the 
user to retry. If there was only one enrollee request, the 
registrar proceeds with the registration protocol using the 
Diffie-Hellman public key from that one request. 


20th USENIX Security Symposium 237 


238 


While PBC’s use of Diffie-Hellman protects the de- 
vices from eavesdropper attacks, an active adversary can 
hide or change any of the messages, by resorting to colli- 
sions, capture effect attacks, or hogging the medium and 
delaying these messages. This allows an adversary to gain 
access to the user’s registrar (e.g., their home network), 
the enrollee device, or to intercept and alter any future 
messages between the enrollee and registrar. Defending 
against such adversaries requires a system that is robust 
to MITM attacks, which is the main contribution of TEP. 


3.2 802.11 


Since our protocol involves low-level details of the 802.11 
standard, we summarize the relevant aspects of 802.11 
in this section. 802.11 requires nodes to sense the wire- 
less medium for energy, and transmit only in its absence. 
802.11 nodes can transmit using a range of bit rates, with 
the minimum bit rate of 1 Mbps. Coupling this with the 
fact that the maximum packet size used by higher layers 
is typically 1500 bytes, an honest node can occupy the 
channel for a maximum of 12 ms. 802.11 requires back- 
to-back packets to be separated by an interval called the 
DCF Inter-Frame Spacing (DIFS), whose value can be 
34us, 50uUs, or 28uUs, depending on whether the network 
uses 802.1 1a, b, or g. 802.11 acknowledgment packets, 
however, can be transmitted after a shorter duration of 
10s, called the Short Inter-Frame Spacing (SIFS). 


4 SECURITY MODEL 


TEP addresses the problem of authenticating key ex- 
change messages between two wireless devices, in the 
presence of an active adversary that may try to mount a 
man-in-the-middle attack. 


4.1 Threat Model 


The adversary can eavesdrop on all the signals on the chan- 
nel, including all prior communications. The adversary 
can also be active and transmit with an arbitrary power, at 
any time, thereby corrupting or overpowering other con- 
current transmissions. The adversary may know the TEP 
protocol, the precise times when devices transmit their 
announcements, and their exact locations. In addition, the 
adversary can know the exact channel between the pairing 
devices, and the channel from the pairing devices to the 
adversary. The adversary can also be anywhere in the 
network and is free to move. Multiple adversaries may 
exist in the network and can collude with each other. 
The adversary can have access to state-of-the-art RF 
technologies: he can have a multi-antenna system, he may 
be able to simultaneously receive and transmit signals, 
and he can use directional antennas to ensure that only 
one of the pairing devices can hear its transmissions. 
The adversary, however, does not have physical control 
over the pairing devices or their surroundings. Specifi- 
cally, the adversary cannot place either of the two devices 


20th USENIX Security Symposium 


Definition 
A wireless message whose presence and the in- 


Term 
Tamper-evident 














announcement tegrity of its payload are guaranteed to be detected 
by every receiver within radio range (Figure 1). 

Synchronization An exceptionally long packet whose presence in- 

packet dicates a TEA. To detect a synchronization packet, 
it is sufficient to detect that the medium is contin- 
uously occupied for the duration of the synchro- 
nization packet, which is 19 ms. 

Payload packet The part of a TEA containing the data payload 
(e.g., a device public key). 

ON-OFF slot The interval used to convey one bit from sender 


to receiver. The slot time is 401s. The bits in the 
slots are balanced, as described in §5.1.2. 

Occupied/ON slot A slot during which the medium is busy with a 
transmission. 














Silent/OFF slot — A slot during which the medium is idle. 

Sensing The interval over which the receiver collects ag- 

window gregate information for whether the medium is 
occupied or silent. 

Fractional The fraction of time the medium was busy during 

occupancy a sensing window. 


Table 1: Terminology used to describe TEP. 


in a Faraday cage to shield all signals. We also assume 
that the adversary cannot break traditional cryptographic 
constructs, such as collision-resistant hash functions. 

Finally, we assume that the PBC buttons operate accord- 
ing to the PBC standard [26] and that the user performs 
the PBC pairing as prescribed in the standard, i.e., the 
user puts the two devices in range then pushes the buttons 
on the two devices within 120 seconds of each other. 


4.2 Security Guarantees 


Under the assumptions outlined above, TEA guarantees 
that an adversary cannot tamper with the payload of a 
TEA message, or mask the fact that a TEA message was 
transmitted. Building on the TEA mechanism, TEP guar- 
antees that in the absence of an active adversary, two 
pairing devices can establish secure pairing. In the pres- 
ence of an adversary who is actively mounting MITM 
attacks (or in the presence of more than two devices at- 
tempting to pair at the same time), TEP ensures that the 
pairing devices will signal an error and never be tricked 
into pairing with the adversary (or, more generally, with 
the wrong device). In other words, TEP provides the PBC 
security guarantees augmented with protection against 
MITM attacks. 


5 TEP DESIGN 


TEP’s design is based upon the TEA mechanism, a uni- 
directional announcement protocol that guarantees that 
adversaries cannot tamper with or mask TEA messages 
without detection. TEP uses TEA to exchange public 
keys between the PBC enrollee and registrar in a way that 
resists MITM attacks. At a high level, when an enrollee 
enters PBC mode, it sends out a TEA message containing 
its public key. When a registrar in PBC mode receives 
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this message (or suspects that an adversary may have tried 
to tamper with or mask such a message), it responds with 
its own public key. Both the enrollee and the registrar 
collect all TEA messages received during PBC’s walk 
time period. If, during that time, each received exactly 
one unique public key (and no tampered messages), they 
can conclude that this public key came from the other 
party, and can use it for pairing. Otherwise, PBC reports 
a session overlap error (e.g., because multiple enrollees 
or registrars were pairing at the same time, or because an 
adversary interfered), and asks the user to retry. 

The rest of this section describes our protocol in more 
detail, starting with the TEA mechanism, using terminol- 
ogy defined in Table 1. 


5.1 Tamper-Evident Announcement (TEA) 


The goal of TEA is to guarantee that if an attacker tampers 
with the payload of a TEA message, or tries to mask the 
fact that a message was transmitted at all, a TEA receiver 
within communication range will detect such tampering. 
In other words, TEA receivers will always detect when a 
TEA message was, or may have been, transmitted. 

To provide this guarantee, TEA messages have a spe- 
cialized structure, as shown in Figure 1. First, there is a 
synchronization packet, which protects the TEA’s trans- 
mission from being masked, by unambiguously indicating 
to a TEA receiver that a TEA message follows. The syn- 
chronization packet contains random data, to ensure that 
an adversary cannot cancel out its energy.! 

Second, the TEA message contains the announcement 
payload. The payload is always of fixed length, to ensure 
that an adversary cannot truncate or extend the payload in 
flight, but otherwise has no restrictions on its content or 
encoding. In our pairing protocol, the payload of a TEA 
message contains the sender’s Diffie-Hellman public key, 
along with other registration information. 

Third, the TEA message contains ON-OFF slots, which 
guarantee that any tampering with a TEA payload is de- 
tectable. Similar to the synchronization packet, the con- 
tent of the ON slots is randomized. The first two slots, as 
shown in Fig. 3, encode the direction flag, which defines 
whether this TEA message was sent by an enrollee (called 
a TEA request, flag value “10’’) or by a registrar (called a 
TEA reply, flag value “01”’). The remaining slots contain 
a cryptographic hash of the payload. While it is possible 
to also encode the payload using slots, it would be ineffi- 
cient for long payloads, and unnecessary, since protecting 
a cryptographic hash suffices. To detect tampering, TEA 
encodes all slots in a way that guarantees that exactly half 
of the slots are silent, as we describe in §5.1.2. 


'Tn practice, it is very hard to cancel a signal in flight but in theory 
an attacker that knows the transmitted signal and the channels to the 
receiver can construct a signal that cancels out the original signal at the 
receiver. Making the data random eliminates this option. 
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Figure 3: Data encoded in the ON-OFF slots. The first two bits 
specify the direction of the message, and the rest of the bits contain 
a cryptographic hash of the payload. 


5.1.1 Detecting tampering 


To determine if an adversary may have tampered with a 
TEA message, a TEA receiver performs several checks. 
First, the receiver continuously monitors the medium for 
possible synchronization packets. If it detects any burst 
of energy at least as long as the synchronization packet, 
it interprets it as the start of a TEA announcement. The 
receiver conservatively assumes that any such period of 
energy is a TEA message, and signals a missed message 
if it is unable to decode and verify the subsequent payload. 
To minimize false positives, we choose a synchronization 
packet that is longer than any regular contiguous WiFi 
transmission. An adversary cannot cancel out a legitimate 
synchronization packet because the adversary cannot elim- 
inate the power on the channel. In fact, since the payload 
of the synchronization packet is random, the adversary 
cannot cancel the power from the packet even if he knows 
the exact channel between Alice and Bob, and is fully 
synchronized with the transmitter. Thus, an adversary 
cannot tamper with the presence of a TEA message by 
masking it out. 

Second, once a TEA receiver detects the start of a TEA 
announcement, it attempts to decode the payload packet 
and the hash bits in the ON-OFF slots. If the receiver can- 
not decode the payload (i.e., the packet checksum fails), 
it indicates tampering. If the payload is decoded, the re- 
ceiver verifies that the hash bits match the hash of the 
payload-i.e., it verifies that hashing the payload produces 
the same bits in the ON-OFF slots and that the number of 
ON slots is equal to that of OFF slots. If the receiver can- 
not verify the hash bits, it conservatively assumes that an 
adversary is tampering with the transmission. Once tam- 
pering is detected, the receiver signals a session overlap 
error (as in PBC), requiring the user to retry later. 


5.1.2. Balancing the ON-OFF Slots 


An adversary can transform an OFF slot to an ON slot 
(by transmitting in it) but cannot transform an ON slot to 
an OFF slot. Hence, to ensure that the adversary cannot 
tamper with even a single OFF slot without being detected, 
we make the number of the OFF slots in a TEA message 
equal to that of the ON slots, i.e., we balance the slots. The 
number of slots is fixed by the TEP protocol, thus avoiding 
truncation or extension attacks. Since the direction flag is 
already encoded in two balanced bits, we now focus on 
balancing the rest of the slots. 
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Our balancing algorithm takes the hash bits of the TEA 
payload and produces a balanced bit sequence to be sent 
in the ON-OFF slots. One inefficient but simple trans- 
formation is to use Manchester encoding of the hash bits 
to produce a balanced output bit sequence with twice as 
many output bits. TEA, however, introduces an efficient 
encoding that takes an even number, N, of input bits and 
produces M = N+2[logN] output bits which have an 
equal number of zeros and ones. The details of our effi- 
cient encoding algorithm are presented in Appendix A. 


5.1.3 Interoperating with 802.11 


To interoperate with other 802.11 devices that may not be 
TEA-aware, the ON-OFF slots are preceded by a CTS- 
to-SELF packet, which reserves the medium for the TEA 
message. This serves two purposes. First, since the sender 
does not transmit during the OFF slots, another 802.11 
node could sense the wireless medium to be idle for more 
than a DIFS period, and start transmitting its own packet 
during that OFF slot. The 802.11 standard requires 802.11 
nodes that hear a CTS-to-SELF on the channel to abstain 
from transmitting for the period mentioned in that packet, 
which will ensure that no legitimate transmission overlaps 
with the slots. Second, in case of a TEA message from an 
enrollee to a potential registrar, the CTS-to-SELF packet 
reserves the medium so that the registrar can immediately 
reply with its own TEA message. This prevents legiti- 
mates nodes from hogging the medium and delaying the 
registrar’s response. However, reserving the channel for 
the entire length of a TEA message is inefficient, if no 
registrar is present. To avoid under-utilization of the wire- 
less medium, the enrollee’s CTS-to-SELF only reserves 
the channel for a DIFS period past its slot transmissions. 
If a PBC-activated registrar is present, it must start trans- 
mitting its response message within the DIFS period. On 
the other hand, if there is no registrar, other legitimate 
devices will resume transmissions promptly. 

To maximize the probability that all devices can decode 
the CTS-to-SELF, it is transmitted at the most robust bit 
rate of 1 Mbps. Current 802.11 implementations obey a 
CTS-to-SELF that reserves the channel up to 32 ms. Our 
TEA message requires 144 slots,” and the slot duration is 
40 us (§6). This translates to about 5.8 ms, which is less 
than the 32 ms allowed by the CTS-to-SELF. 

Finally, as shown in Figure 1, there is a gap between 
the synchronization and payload packets. If this gap is 
large, other 802.11 nodes would sense an idle wireless 
medium, and start transmitting, thus appearing to tamper 
with the TEA. To avoid this, we exploit the fact that 


2Two of the slots are for the direction bit, and the remaining 142 
are for the bit-balanced hash bits. More specifically, the bit balancing 
algorithm, in §5.1.2, takes N input bits and outputs N + 2{/ogN] bits. 
Since the hash is a 128 bit function, the bit balancing algorithm produces 
142 bit balanced hash bits. 
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802.11 nodes are only allowed to transmit if they find 
the medium continuously idle for a DIFS. Thus, a TEA 
sender sends the payload packet immediately after the 
synchronization packet with a gap of a Short Interframe 
Space (SIFS), which is much less than DIFS. 


5.1.4 API Summary 


The interface provided by TEA is as follows. For the 
sender side, there is a single blocking function, 


e void TEA SEND (bool dir, str msg, time 1), 


which sends an announcement containing payload msg. 
The dir flag specifies the direction of the message, that 
is, whether it is a request message (from the enrollee) 
or a reply message (from the registrar). Time ¢ specifies 
the deadline by which the message must start transmis- 
sion. The TEA sender tries to respect carrier-sense in 
the medium access control (MAC) protocol, and waits 
until the medium is idle before transmitting its message. 
However, if the message cannot be transmitted by time ¢ 
(e.g., because an adversary is hogging the medium), the 
sender overrides the MAC’s carrier-sense, and transmits 
the announcement anyway, so that recipients will detect 
tampering. Note that the CTS-to-SELF requires honest 
nodes to release the medium for the registrar to transmit 
its own TEA reply. 
For the receiver side, TEA provides two functions, 


e handle TEA_RECV_START (bool dir), and 
e msg_list TEA_RECV_GET (handle h). 


The first function, TEA RECV_START, starts listening 
on the wireless medium for TEA messages that are ei- 
ther requests (from an enrollee) or replies (from a reg- 
istrar), based on the dir flag. The second function, 
TEA_RECV-_GET, is used to retrieve the set of messages 
accumulated by the receiver since TEA_RECV_START or 
TEA_RECV-GET was last invoked. If TEA-RECV_GET 
could not decode a possible TEA message (or verify that it 
was not tampered with), it returns a special value RETRY, 
which causes the caller (1.e., TEP) to re-run its proto- 
col. As an optimization, if all of the TEA messages that 
TEA_RECV_GET was unable to decode were overlapping 
with the receiver’s own transmissions (i.e., a concurrent 
TEA-SEND), TEA_-RECV-_GET returns a special value 
OVERLAP instead of RETRY. We describe in §6.4 how 
a node detects TEA messages that overlap with its own 
transmissions, and in Appendix B how we use the overlap 
information to optimize wireless medium utilization. 


5.2 Securing PBC using TEA 


Using the TEA mechanism, we will now describe how 
TEP—a modified version of the PBC protocol—avoids 
man-in-the-middle attacks. 
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Once the button is pressed on the enrollee, the enrollee 
repeatedly scans the 802.11 channels in a round robin 
manner, as in the current PBC protocol. On each channel, 
the enrollee transmits a TEA request, i.e., a TEA message 
with the direction flag set to “10”. The TEA request con- 
tains the enrollee’s public key (and any PBC information 
included in an enrollee’s probe). If an adversary continu- 
ously occupies the medium for tx_tmo (e.g., 1 second), the 
enrollee overrides carrier-sense and transmits its message 
anyway. The enrollee then waits for a TEA response from 
a registrar, which is required to immediately respond. The 
enrollee records the responses, if any, and after a speci- 
fied period on each channel it moves to the next 802.11 
channel and repeats the process. The enrollee continues 
to cycle through all 802.11 channels for PBC’s walk time 
period. The enrollee’s logic corresponds to the following 
pseudo-code to build up r, the set of registrar responses: 

r—®O 
for 120 sec + #channels x (tx_tmo + 2 x tea_duration) 
do > walk time + max enrollee scan period 
switch to next 802.11 channel 
h — TEA _RECV-START (reply) 
TEA_SEND (request, enroll_info, now + tx_tmo) 
SLEEP (tea_duration) 
r—rU TEA _RECV.GET (h) 

end for 


A registrar follows a similar protocol. Once the PBC 
button is pressed, the registrar starts listening for possible 
TEA requests on its 802.11 channel. Every time a TEA 
message is received, the registrar records the message 
payload, and immediately sends its own TEA message in 
response, containing the registrar’s public key. It is safe 
to reply immediately because the sender’s TEA message 
ended with a CTS-to-SELF, which reserved the medium 
for the registrar’s reply. The registrar’s pseudo-code to 
build up e, the set of enrollee messages, is as follows: 

et) 

h — TEA_RECV_START (request) 

for 120 sec + #channels x (tx_tmo +2 x tea_duration) 

do > walk time + max enrollee scan period 
m<— TEA_RECV-GET (h) 
ifm @ then > enrollee, RETRY, or OVERLAP 
e—eUm 
TEA_SEND (reply, registrar_info, now) 
> send reply immediately 
end if 

end for 

After the PBC’s walk time expires, both the enrollee 
and the registrar check the list of received messages. Suc- 
cessful pairing requires that both the enrollee and the 
registrar receive exactly one unique public key via TEA 
messages, and that no messages were tampered with (1.e., 
TEA_RECV-_GET never returned RETRY or OVERLAP). If 
exactly one public key was received, it must have been the 
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public key of the other party, and TEP can safely proceed 
with pairing. If more than one public key was received, 
or RETRY or OVERLAP was returned, then a session over- 
lap error is raised, indicating that more than one pair of 
devices may be attempting to pair, or that an adversary is 
mounting an attack. In this situation, the user must retry 
pairing. 


5.2.1 Reducing Medium Occupancy 


The protocol described above is correct and secure (as 
we will prove in $7.1). However, it can be inefficient if 
somehow multiple registrars transmit overlapping replies 
at almost the same time. Each of them will then assume 
it may have missed a request from some enrollee (since it 
sensed a concurrent TEA message), and each will re-send 
its reply. This cycle may continue for the walk time of 120 
seconds, unnecessarily occupying the wireless medium. 
In Appendix B, we describe an optimization that avoids 
this situation and we prove that the optimized protocol 
maintains the same security guarantees. 


5.3. Example scenarios 


Figure 4 shows how TEP works in five potential scenar- 
ios. In scenario (a), there is no attacker. In this case, 
the enrollee sends a request to which the registrar replies 
immediately. The two devices can thus proceed to com- 
plete pairing after 120 seconds. In scenario (b), the en- 
rollee transmits its request, but the attacker immediately 
jams it so that the registrar can not decode the enrollee’s 
request. However, the registrar detects a long burst of 
energy, which the registrar interprets as a TEA announce- 
ment, causing it to reply to the enrollee. 

In scenario (c), the enrollee sends the request; the at- 
tacker then captures the medium at the same time as the 
registrar, and transmits a reply, at a high power, imperson- 
ating the registrar. Because of capture effect, the enrollee 
decodes the message payload from the attacker. But since 
the registrar and the attacker transmit the hash function 
of different messages in the ON-OFF slots, the enrollee 
notes that the slots do not have equal number of zeros and 
ones and hence detects tampering with the announcement. 

In scenario (d), the adversary sends a request message 
in an attempt to gain access to the registrar; as stipulated 
by TEP, the registrar replies to this request. However, 
since the registrar waits for 120 seconds before complet- 
ing the pairing, it also hears the request from the enrollee. 
Since the registrar receives requests from two devices, it 
raises a session overlap error. 

Finally, in scenario (e), the adversary sends a TEA re- 
quest, receives the registrar’s reply, and then continuously 
jams the enrollee using a directional antenna. By using 
a directional antenna, the adversary ensures that the reg- 
istrar does not detect the jamming signal and hence does 
not interpret it as an invalid TEA. The enrollee carrier- 
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Figure 4: Timelines of five example runs of the TEP protocol. 


senses, detects that the medium is occupied, and does not 
transmit until it times out after tx_tmo seconds, at which 
point it ignores carrier sense and transmits its TEA re- 
quest. The registrar listens to this request message and 
detects the presence of the enrollee. Since the registrar 
receives requests from two devices, it raises a session 
overlap error. 


5.4 Making Pairing Faster 


The extension of PBC to use TEA, described above, re- 
quires the enrollee and registrar to wait for 120 seconds 
before completing the association process. If the enrollee 
does not wait for a full 120 seconds, and simply picks 
the first responding registrar, it may pick an adversary’s 
registrar—a legitimate registrar only replies when its PBC 
button has been pushed, and the user might push the reg- 
istrar’s PBC button slightly later than the enrollee’s. Be- 
cause the enrollee does not know if the user has already 
pushed the registrar’s button, it has to wait for 120 sec- 
onds to be sure that the user has pushed the button. In this 
section, we describe how one can eliminate this delay. 

First, if the user always pushes the enrollee’s button 
before the registrar’s button, then the registrar does not 
need to wait for 120 seconds; the registrar needs to wait 
for just the time it takes an enrollee to cycle through all 
of 802.11’s channels (which is less than 12s). Second, we 
can also eliminate the enrollee’s wait time. Specifically, 
if the user explicitly tells the enrollee that the registrar’s 
button was pushed, the enrollee can complete the associa- 
tion process after one cycle through the 802.11 channels, 
eliminating the additional wait time. 

For example, one approach would be to have the user 
first press the button on the enrollee, then press the button 
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on the registrar, and then again push the button on the 
enrollee. Note that, in this approach, the registrar does 
not have to wait for 120 seconds: because the registrar’s 
button is always pushed after the enrollee, the registrar 
knows that the enrollee is active, and is guaranteed to see 
the enrollee’s TEA message within the time required for 
the enrollee to cycle through all 802.11 channels. (Of 
course, if the 120 second period expires on the enrollee 
without any additional button pushes, the enrollee can pro- 
ceed to completion as before, with 2 total button pushes 
from the user.) 


6 TEA ON OFF-THE-SHELF HARDWARE 


We implement TEA on Atheros AR5001X-+ chipsets by 
modifying the ath5k driver, and running TEA’s timing- 
sensitive code in a kernel driver. 


6.1 Scheduling Slot Transmission 


To reduce the air time of a TEA, we must minimize the 
size of a single slot packet in the ON-OFF slots. Since 
the slot packet’s payload need not be decoded (just the 
presence or absence of a slot packet conveys a | or 0 bit), 
we transmit slot packets at the highest bitrate, 54 Mbps, 
for a total of 40 Us. 

In addition to reducing the size of a slot packet, TEA 
must transmit slot packets at precise slot boundaries. 
Queueing in the kernel and carrier-sense in the card make 
precise transmission timing challenging. We avoid ker- 
nel queueing by implementing TEA in a kernel driver 
and using high-resolution timers. We avoid delays in the 
wireless card itself through several changes to the card 
firmware and driver, as follows. For the duration of the 
slots, we disable binary exponential backoff (802.11 BEB) 
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by setting CWysy and CWyax to 1. To prevent carrier- 
sense backoff, we disable automatic noise calibration by 
setting the noise floor register to “high”. We place slot 
packets in the high-priority queue. Finally, we disable the 
transmitter’s own beacons by disabling the beacon queue. 
In aggregate, these changes allow us to make slot packets 
as short as 40 us and maintain accurate slot timing. 


6.2 Energy Detection at the Receiver 


A TEA receiver detects a synchronization packet and dis- 
tinguishes ON from OFF slots by checking the energy 
level on the medium. Hence, the receiver needs to dis- 
tinguish the noise level, which is around -90dB, from an 
actual transmission. To do this, we set the noise floor 
to -90dB and deactivate auto-calibration while running 
TEP. 

While an ideal receiver would detect energy at the finest 
resolution (i.e., every signal sample), existing wireless 
chipsets do not give access to these samples. Instead, 
we exploit two registers provided by the ath5k firmware: 
ARSK_PROFCNT_CYCLE and AR5K_PROFCNT_RXCLR. 
The first register is incremented every clock cycle based 
on the clock on the wireless hardware. The second register 
on the other hand is increment only if the hardware finds 
high energy during that clock cycle. 

Using these registers, we define a sensing window (SW) 
as the interval over which the receiver collects aggregate 
information for whether the medium is occupied or silent, 
as defined in Table 1. At the beginning of a SW, a TEA re- 
ceiver resets both registers to 0, and reads them at the end 
of the SW. The ratio of these two registers at the end of 
the SW, 42 RRO TNT eyerp: is defined as the fractional 
occupancy. By putting a threshold on the fractional occu- 
pancy, a TEA receiver can detect whether the medium is 
occupied in a particular SW, and hence can detect energy 
bursts and measure their durations in units of the sensing 
window. Similar to the sender, a TEA receiver runs in the 
kernel to precisely schedule sensing windows. 

Our implementation dynamically adjusts the length of 
the sensing window to minimize system overhead. The 
TEA receiver uses a long sensing window of 2 ms, un- 
til it detects a burst of energy longer than 17 ms. This 
indicates a synchronization packet, at which point the 
receiver switches to a 20 Us sensing window to accurately 
measure energy during slots, providing on average two 
sensing window measurements for every slot. 


3 There is a tradeoff between the noise floor and the permissible 
distance between the pairing devices. In particular, pairing devices 
separated by large distances have a weak signal and hence, to ensure 
detection, the noise floor should be set to a low value. On the other 
hand, pairing devices that are closer have a stronger signal, and hence 
the noise floor can be set to a higher value. We pick -90dB because it is 
the default noise floor value in typical WiFi implementations. Manufac- 
turers, however, can pick a higher default value, as long as the pairing 
devices are placed closer to each other. 
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The receiver must be careful to ensure that a 20 Us sens- 
ing window allows accurate detection of slot occupancy. 
But, because the sender and receiver are not synchronized, 
sensing windows may not be aligned with slots, and in 
the worst case, will be off by half a sensing window, i.e., 
10 us. However, having a sensing window that is half 
the length of a slot ensures that at least one of every two 
sensing windows is completely within a slot (i.e., does not 
cross a slot boundary). Thus, to measure slot occupancy, 
the receiver compares the variance of odd-numbered sens- 
ing window measurements and even-numbered sensing 
window measurements, and uses the one with the highest 
variance. Because the slots are bit-balanced, the correct 
sequence will have an equal number of ones and zeros, 
having the higher variance. 


This technique for measuring slot occupancy is secure 
in the presence of an adversary. As we will prove in 
Proposition 7.1, an adversary can introduce energy, but 
cannot cancel energy in an occupied slot. Thus, the adver- 
sary can only increase — but cannot reduce— the computed 
occupancy ratios in either the odd or the even windows. 
As a result, the adversary cannot create a different bit 
sequence in either the odd or even windows which still 
has an equal number of ones and zeros. Thus, sampling 
at twice the slot rate maintains TEA’s security guarantees. 


6.3 Sending A Synchronization Packet 


To transmit a long synchronization packet, TEA trans- 
mits the maximum-sized packet allowed by our hardware 
(2400 bytes) at the lowest bit rate (1 Mbps), resulting in a 
19 ms synchronization packet. While many receivers drop 
such long packets (the maximum packet size permissible 
by the higher layers is 1500 bytes), this does not affect a 
TEA receiver, since it does not need to decode the packet; 
it only needs to detect a long burst of energy. 


6.4 Checking for TEA While Transmitting 


While executing the TEP protocol (which lasts for 120 
seconds), a node must detect TEA messages transmitted 
by other nodes even if they overlap with its own trans- 
missions. We distinguish two cases: First, when the node 
transmits a standard 802.11 packet, it conservatively as- 
sumes that the channel has been occupied by part of a 
synchronization packet for the duration of its transmis- 
sion. The node samples the medium before and after its 
transmission, checking for continuous occupancy by a 
synchronization packet. As our evaluation shows (87.3), 
the longest packets in operational WiFi networks are about 
4 ms (a collision of two packets sent at the lowest 802.11¢ 
rate of 6 Mb/s), making synchronization packet false pos- 
itives unlikely even with the conservative assumption that 
the entire 4 ms transmission overlapped with part of a 
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synchronization packet (19 ms).* 

Second, a node that is transmitting a TEA request must 
not miss a concurrently transmitted TEA reply, and simi- 
larly a node that is transmitting a reply must not miss a 
concurrent request. To detect partially-overlapping TEA 
messages, a node samples the medium before and after 
every synchronization packet, and after the slots of every 
TEA message, and if it detects energy, it assumes that it 
may have missed an overlapping TEA message (and thus, 
TEA_RECV_GET will return OVERLAP, unless it observes 
other possibly-missed messages, in which case it will re- 
turn RETRY.) Since the total length of the ON-OFF slots 
is shorter than the length of the synchronization packet, 
sampling the medium after the end of a synchronization 
packet (i.e., before the start of the payload and slots) and 
after the end of the slots suffices to detect an overlapping 
synchronization packet. Finally, in the case when two 
TEA messages are perfectly synchronized, the node uses 
the direction bits to detect a collision. Since the direction 
flag for a request is “10” and a reply “01”, the node checks 
for this scenario by checking the energy level during the 
OFF slot in the direction field in its own transmission. If 
the OFF slot shows a high energy level, TEA_RECV_GET 
will return OVERLAP (or RETRY, if there are other missed 
messages). 


7 EVALUATION 


We evaluate TEP along three axes: security, accuracy, and 
performance. Our findings are as follows: 


e TEP is provably secure to MITM attacks. 

e TEP can be accurately realized using existing OS and 
802.11 hardware. Specifically, our prototype sender can 
schedule ON-OFF slots at a resolution of 40s, and 
its 95" percentile scheduling error is as low as 1.65ps. 
Our prototype receiver can sense the medium’s occu- 
pancy over periods as small as 20s and can distinguish 
ON slots from OFF slots with a zero error rate. 

e Results from two operational networks—our campus 
network and SIGCOMM 2010—show that TEP never 
confuses cross traffic for an attack. Further, even in 
the presence of Bluetooth devices which do not obey 
CTS-to-SELF and may transmit during TEP’s OFF 
slots, TEP can perform key exchange in 1.4 attempts, 
on average. 


7.1 Evaluating TEP’s Security 


We analyze TEP’s security using the threat model in §4.1. 
To do so, we formally state our definitions, then prove 
that a TEA is tamper resistant and that wireless pairing 
using TEP is secure to MITM attacks. 


4Note that even if some networks have normal packets that are much 


larger than 4 ms, this may create false positives but does not affect the 
security of the protocol. 
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Definition Tamper evident: A message is said to be tam- 
per evident if an adversary can neither change the mes- 
sage’s content without being detected nor hide the fact 
that the message has been transmitted. 


Before we proceed to prove that a TEA is tamper ev- 
ident we first prove the following proposition about the 
capability of an adversary. 


Proposition 7.1 Let s(t) be the transmitted signal, and 
h(t) be the channel impulse function. Assuming the trans- 
mitted signal is unpredictable, and the receiver is within 
radio range of the sender, an adversary cannot cancel the 
signal energy at the receiver even if he knows the channel 
function between the sender and receiver, h(t). 


Proof The received signal is a convolution of the trans- 
mitted signal and the channel impulse function, plus the 
adversary’s signal a(t), plus white Gaussian noise n(f), 
ie., r(t) = h(t) *s(t)+a(t)+n(t). To cancel the received 
energy, the adversary needs to produce a signal a(t) so 
that r(t) ~ n(t), or equivalently, h(t) *« s(t) +a(t) <<n(t). 
Since the receiver is within radio range of the sender, 
we know h(t) « s(t) >> n(t), and, since n(t) is physically 
unpredictable, that a(t) + —h(t) * s(t). But an adversary 
that can compute such an a(t) directly contradicts our as- 
sumption that s(t) is unpredictable, and thus an adversary 
cannot compute such an a(t). 














Since the synchronization packet and ON slots have 
random contents, Prop. 7.1 implies that an adversary can- 
not hide the channel energy during the transmission of the 
synchronization packet or the ON slots from a receiver. 
Based on this result we proceed to prove the following: 


Proposition 7.2 Given the transmitter and receiver are 
within range, and the receiver is sensing the medium, a 
TEA, described in 5.1, is tamper evident. 


Proof We prove Prop. 7.2 by contradiction. Assume that 
one party, Alice, sends a TEA to a second party, Bob. Sup- 
pose that Alice’s TEA to Bob fails to be tamper-evident. 
This can happen because the adversary succeeds either in 
hiding from Bob that Alice sent a TEA, or in changing 
the TEA content without being detected by Bob. To hide 
Alice’s TEA, the adversary must convince Bob that no 
synchronization packet was transmitted. This requires 
the adversary to cancel the energy of the synchronization 
packet at Bob, which contradicts Prop. 7.1. Thus, the 
adversary must have changed the announcement content. 

Suppose the adversary changed the data encoded in the 
slots. Prop. 7.1 says that the adversary cannot cancel the 
energy in an ON slot, and hence cannot change an ON 
slot to an OFF slot. Since the number of ON and OFF 
slots is balanced, the adversary cannot change the slots 
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without increasing the number of ON slots, and thus being 
detected. Thus, the only alternative is that the adversary 
must have changed the message packet. Since the ON- 
OFF slots include a cryptographic hash of the message, 
this means that the adversary constructed a different mes- 
sage packet with the same hash as the original message 
packet. This contradicts our assumption that the hash is 
collision-resistant. Thus, the adversary cannot alter the 
announcement content, and TEA is tamper-evident. 














Although Prop. 7.2 guarantees that a TEA message is 
tamper-evident if the receiver is sensing the medium, the 
receiver may be transmitting its own message at the same 
time. We now prove that a TEA is tamper-evident even if 
the receiver transmits its own messages. 


Proposition 7.3 Given a receiver (Bob) that can send 
its own messages, a TEA sent by a transmitter (Alice) in 
range of the receiver is tamper-evident, if the receiver 
follows the concurrent-transmission protocol of §6.4, and 
the receiver and transmitter send TEA messages with 
different directions (request or reply). 


Proof If Bob detects the synchronization packet (SP) 
of Alice’s TEA, the TEA is tamper-evident: either Bob 
will refrain from sending during that TEA, in which case 
Prop. 7.2 applies, or Bob will transmit concurrently, and 
TEA_RECV-_GET will return RETRY of OVERLAP . 

If Bob fails to detect Alice’s SP, it must have hap- 
pened while Bob was sending his own message (other- 
wise, Prop. 7.2 applies). Since regular 802.11 packets 
are shorter than a SP, and §6.4 conservatively assumes 
the medium was occupied for the entire duration of the 
transmitted packet, Bob could not have missed a SP while 
sending a regular packet. Thus, the only remaining option 
is that Alice’s SP overlapped with a TEA sent by Bob. 

Consider four cases for when Alice’s SP was sent in re- 
lation to the SP of Bob’s TEA. First, if Alice’s SP started 
before Bob’s SP, Bob would detect energy before starting 
to transmit his SP and return OVERLAP or RETRY (86.4), 
making the TEA tamper-evident. Second, if Alice’s SP 
started exactly at the same time as Bob’s SP, Bob would 
detect energy during the direction bits and return OVER- 
LAP or RETRY (86.4), making the TEA tamper-evident. 
Third, if Alice’s SP started during Bob’s SP, Bob would 
detect energy after his SP and return OVERLAP or RETRY 
(86.4), making the TEA tamper-evident. Fourth, if Al- 
ice’s SP started after Bob’s SP ended, Bob would detect 
energy from Alice’s SP after the end of his TEA slots 
and return OVERLAP or RETRY (86.4), making the TEA 
tamper-evident. Thus, in all cases, the TEA is tamper- 
evident. 














We now prove TEP is secure against a MITM attack. 
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Proposition 7.4 Suppose an enrollee and a registrar are 
within range, both are following the TEP protocol as 
described in §5.2 and the user does the stipulated actions 
required by PBC. Under the threat model defined in 84.1, 
an adversary cannot convince either the enrollee or the 
registrar to accept any public key that is not the legitimate 
public key of the other device. 


Proof We prove Prop. 7.4 by contradiction, considering 
first the registrar, and then the enrollee. 

First, suppose an adversary convinces the registrar to 
accept a public key other than that of the enrollee. By 
85.2, this means the registrar received exactly one public 
key (and, thus, did not receive the enrollee’s key), and 
TEA_RECV-GET never returned OVERLAP or RETRY. By 
assumption, the enrollee and registrar entered PBC mode 
within 120 seconds of each other, which means they were 
concurrently running their respective pseudo-code for 
at least #channels x (tx_tmo + 2 x tea_duration) seconds, 
and therefore the enrollee must have transmitted at least 
one TEA message on the registrar’s channel while the reg- 
istrar was listening. Prop. 7.3 guarantees that the registrar 
must have either received that one message, or detected 
tampering (and returned OVERLAP or RETRY), which con- 
tradicts our assumption that the registrar never received 
the enrollee’s message and never returned OVERLAP or 
RETRY. Thus, an adversary cannot convince the registrar 
to accept a public key other than that of the enrollee. 

Second, suppose an adversary convinces the enrollee to 
accept a public key other than that of the registrar. By 85.2, 
this means that the enrollee received exactly one public 
key response to its requests (and, thus, did not receive 
the registrar’s key), and TEA_RECV-_GET never returned 
OVERLAP or RETRY. As above, there must have been a 
time when the registrar was listening, and the enrollee 
transmitted its request message on the registrar’s channel. 
Prop. 7.3 guarantees that the registrar must have either 
received the enrollee’s message, or detected tampering 
(and returned OVERLAP or RETRY). In both of those 
cases, §5.2 requires the registrar to send a reply. Prop 7.3 
similarly guarantees that the enrollee must have either 
received the registrar’s reply, or detected tampering (and 
returned OVERLAP or RETRY), which directly contradicts 
our supposition. Thus, an adversary cannot convince the 
enrollee to accept a public key other than the registrar’s, 
and TEP is secure. 














7.2 Evaluating TEP’s Accuracy 


We check whether TEP can be accurately realized us- 
ing existing operating systems and off-the-shelf 802.11 
hardware. Our experiments use our Ath5K prototype de- 
scribed in 86 and run over our campus network. Figure 5 
shows the locations of the TEP nodes, which span an area 
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170 feet 


Figure 5: Locations of nodes (indicated by blue circles) in our ex- 
perimental testbed, which operates as part of our campus network. 
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Figure 6: CDF of TEP slot scheduling errors. The figure shows 
that the maximum scheduling error is 1.8 1s which is significantly 
lower than the slot duration of 40uUs. 


of 21,080 square feet (1,958 m*) with both line-of-sight 
and non-line-of-sight links. 


7.2.1 Transmitter 


The performance of TEP hinges on the transmitter accu- 
rately scheduling the transmission of the ON-OFF slots. 
The difficulty in accurate scheduling arises from the fact 
that we want to implement the protocol in software using 
standard 802.11 chipsets. Hence, we are limited by the 
operating system and the hardware interface. For exam- 
ple, if the kernel or the hardware introduces extra delays 
between the slot packets, it will alter the bit sequence con- 
veyed to the receiver, and will cause failures. Given that 
our slot is 401s, we need an accuracy that is on the order 
of few microseconds. Can we achieve such an accuracy 
with existing kernels and chipsets? 

Experiment. We focus on the most challenging ON- 
OFF slot sequence from a scheduling perspective: al- 
ternating zeros and ones which requires the maximum 
scheduling precision. We set the slot time to 40s, by 
sending a packet at the highest bitrate of 54 Mbps. To 
measure the produced slots accurately, we capture the 
signal transmitted by our 802.11 sender using a USRP2 
software radio board [9]. Our USRP2 board can mea- 
sure signal samples at a resolution of 0.16 js, allowing 
us to accurately compute the duration of the produced 
slots. We run the experiment 1000 times for each sender 
in our testbed and measure the exact duration of every 
slot. We then compute the scheduling error as the differ- 
ence between the measured slot duration and the intended 
40 us. 
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Figure 7: CDFs of the fractional occupancy during ON slots and 
OFF slots. The figure shows that the two distributions have no 
overlap and hence the receiver cannot confuse ON and OFF slots. 


Results. Fig. 6 shows the CDF of slot scheduling errors. 
The figure shows that the median scheduling error is less 
than 0.4 us and the maximum error is 1.8 fs. Thus, 
despite operating in software and with existing chipsets, 
a TEP sender can accurately schedule the ON-OFF slots 
at microsecond granularity. 


7.2.2 Receiver 


TEP’s security depends on the receiver’s ability to distin- 
guish ON slots from OFF slots. In this section, we check 
that given that the receiver is within the sender’s radio 
range (i.e., can sense the sender’s signal), it can clearly 
distinguish ON slots from OFF slots. 

Experiment. In each run, the sender sends a sequence 
of alternating ON-OFF slots, using a slot duration of 
40 us. The receiver uses a sensing window of 20s to 
measure fractional occupancy. This means the receiver 
has twice as many measurements of fractional occupancy 
as there are slots. As explained in 86.2, the receiver keeps 
either the odd or even measurements depending on which 
sequence has higher variance. Hence, for each slot, the re- 
ceiver has exactly one fractional occupancy measurement. 
We then compare the measured fractional occupancy for 
known ON slots vs. known OFF slots to determine if the 
receiver can reliably distinguish between them based on 
measured fractional occupancy. We randomly pick two 
nodes in the testbed to be sender and receiver, and repeat 
the experiment for various node pairs in the testbed. 

Results. Fig. 7 plots the CDFs of fractional occupancy 
for ON slots and OFF slots. The figure shows that the two 
CDFs are completely separate; that is, there is no overlap 
in the values of fractional occupancy that correspond to 
OFF slots and those that correspond to ON slots. Hence, 
by looking at the fractional occupancy the receiver can 
perfectly distinguish the ON slots from OFF slots. This 
result shows that a TEP receiver based on current OSes 
and 802.11 hardware can accurately decode the ON-OFF 
slots necessary for the TEP protocol. 


7.3 Evaluating TEP’s Performance 


We are interested in how TEP interacts with cross traffic 
in an operational network. Cross traffic does not hamper 
TEP’s security (the proofs in §7.1 apply in the presence 
of cross traffic). However, cross traffic may cause false 
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Figure 8: CDF of the duration of energy bursts in the SIG- 
COMM 2010 network and our campus network. The figure shows 
that energy bursts caused by normal traffic are much shorter than 
a TEP synchronization packet (19 ms). Thus, it is unlikely that 
TEP will confuse normal traffic as a synchronization packet. 


positives, where a node incorrectly declares that a TEP 
message has been tampered with by an adversary. Such 
events can unnecessarily delay secure pairing. 

We investigate TEP’s interaction with cross traffic using 
results from two operational networks: the SIGCOMM 
2010 network, which is a heavily congested network, and 
our campus network, which is a moderately congested 
network. As in §7.2, our experiments use our modified 
Ath5k driver on AR5001X+ Atheros chipsets. In addition 
to cross-traffic on the TEP channel, both networks carried 
traffic on adjacent 802.11 channels. 


7.3.1 Impact of Cross Traffic on a Sync Packet 


In TEP, a receiver detects a TEA if the medium is contin- 
uously occupied for a period longer than the duration of a 
synchronization packet (19 ms). We would like to check 
that a receiver is unlikely to encounter false positives 
while detecting synchronization packets. False positives 
could occur in two scenarios: either (1) legitimate traffic 
includes such continuous long bursts of energy, or (2) 
a TEP receiver is incapable of detecting the short DIFS 
intervals that separate legitimate packets, and mistakes a 
sequence of back-to-back WiFi packets as a continuous 
burst of energy.> We empirically study each case below. 
Experiment 1. We first check whether legitimate traf- 
fic can cause the medium to be continuously occupied 
for a duration of 19 ms. We use two production net- 
works: our campus network and the SIGCOMM 2010 
network. Since we would like to capture all kinds of en- 
ergy bursts, including collisions, we sense the medium 
using USRP2 radios. USRP2s allow us to directly look 
at the signal samples and hence are much more sensitive 
than 802.11 cards. We used a USRP2 board to eavesdrop 
on the channel on which these networks operate and log 
the raw signal samples. In order to compute the length of 
bursts on the channel, we need to be able to identify the 
beginning of a burst and its end in an automated way. To 


5A data packet and its ACK are separated by a SIFS, which is smaller 
than a DIFS, but ACKs are short packets and the next data packet is 
separated by a DIFS. Hence the maximum packing occurs with back-to- 
back data packets without ACKs. 
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do so, we use the double sliding window packet detection 
algorithm® typically used in hardware to detect packet 
arrivals [23]. We collected over a million packets on the 
SIGCOMM network and about the same number on our 
campus network. We processed each trace to extract the 
energy bursts and their durations (as explained above) and 
plot the CDF of energy burst durations in Fig. 8. 

Result 1. The results in Fig. 8 show that all energy 
bursts in both networks lasted for less than 4.3 ms, which 
is much shorter than a TEP synchronization packet. In par- 
ticular, the majority of energy bursts last between 0.25 ms 
and 2 ms. This corresponds to a packet size of 1500 bytes 
transmitted at a bit rate between 6 Mb/s and 48 Mb/s, 
which spans the range of 802.11g bit rates. A few bursts 
lasted for less time which are likely to be short ACK pack- 
ets. Also a few bursts have lasted longer than 2 ms. Such 
longer bursts are typically due to collisions. Fig. 9 illus- 
trates this case, where the second packet starts just before 
the first packet ends, causing a spike in the energy level 
on the channel. Soon after, the first packet ends, causing 
the energy to drop again, but the two transmissions have 
already collided.’ Interestingly, the bit rates used in our 
campus network are lower than those used at SIGCOMM. 
This is likely because at SIGCOMM, the access point was 
in the conference room and in line-of-sight of senders and 
receivers, while in our campus, an access point serves 
multiple offices that span a significant area and are rarely 
in line-of-sight of the access point. 

Overall, the results in Fig. 8 indicate that bursts of 
energy in today’s production networks have significantly 
shorter durations than TEP’s synchronization packet, and 
hence are unlikely to cause false positives. 

Experiment 2. The second scenario in which a node 
may incorrectly detect a synchronization packet occurs 
when the node confuses a sequence of back-to-back pack- 
ets separated by DIFS as a single continuous energy burst. 
Thus, we evaluate our prototype’s ability to distinguish 
a synchronization packet from a stream of back-to-back 
802.11 packets. To do so, we randomly pick two random 
nodes in our testbed in Fig. 5, and make one node trans- 
mit a stream of back-to-back 1500-byte packets at the 
lowest rate of 1 Mbps, while the other node senses the 


6The double sliding window algorithm compares the energy in two 
consecutive sliding windows. If there is no packet, i.e., the two windows 
are both capturing noise, the ratio of their energy is around one. Simi- 
larly, if both windows are already in the middle of a packet, their relative 
energy is one. In contrast, when one window is partially sliding into a 
packet while the other is still capturing noise, the ratio between their 
energy starts increasing. The ratio spikes, when one window is fully 
into a packet while the other is still fully in the noise, which indicates 
that the beginning of the packet is at the boundary between the two 
windows. Analogously, a steep dip in energy corresponds to the end of 
a packet [23]. 

7Collisions of two 1500-byte packets transmitted at 6 Mb/s may be 
slightly longer than 4 ms because of the additional symbols correspond- 
ing to link layer header and trailer, and the PHY layer preamble. 
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Figure 9: The energy pattern of the maximum energy burst in 
the SIGCOMM trace. The figure indicates that such relatively long 
bursts are due to collisions at the lowest bit rate of 6 Mb/s. The 
other spikes correspond to packets sent at higher bit rates. 
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Figure 10: CDF of fractional occupancy measured by a receiver 
for transmissions of either a synchronization packet or a sequence 
of back-to-back 1500-byte packets separated by DIFS. The figure 
shows a full separation between the two CDFs, indicating that a 
TEP receiver does not confuse back-to-back packets as a synchro- 
nization packet. 
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Figure 11: Energy pattern for TEA slots in the presence of a 
Bluetooth device causing interference. 


medium using the default sensing window of 2 ms. We 
then make the same sender transmit a stream of synchro- 
nization packets while the receiver senses these packets 
using a 2 ms window. For both cases, we compute the 
fractional occupancy in each sensing window. We repeat 
the experiment with multiple node pairs and compare the 
fractional occupancy during back-to-back packets and 
synchronization packets. 


Result 2. Fig. 10 compares the CDF of the fractional 
occupancy during a synchronization packet and the CDF 
of the fractional occupancy when the sensing window in- 
cludes back-to-back packets separated by a DIFS,® taken 
over 100K synchronization packets and 100K DIFS oc- 
currences. The figure shows that the two CDFs are suf- 
ficiently separate making it unlikely that TEP confuses 
back-to-back packets as a synchronization packet. 


8Sometimes the DIFS may be split between two consecutive sensing 
windows, in this case we include in the CDF whichever of these two 
window has the lower fractional energy. This is because it is sufficient 
that one sensing window shows a relatively low fractional occupancy to 
declare the end of energy burst. 
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Figure 12: Number of attempts required for TEP to successfully 
pair in the presence of an interfering Bluetooth device. 


7.4 Performance with Non-802.11 Traffic 


Finally, while 802.11 nodes comply with the rules of CTS- 
to-SELF, and abstain from transmitting during TEA’s ON- 
OFF slots, other devices may continue to transmit, caus- 
ing TEA nodes to detect tampering. Fig. 11 shows a 
collision between a TEA and a Bluetooth transmission 
from an Android phone as captured by a USRP2. Blue- 
tooth devices do not typically decode 802.11 CTS-to- 
SELF packets, and hence, as shown in the figure, end up 
transmitting during the ON-OFF slots. In this section we 
examine the impact of a nearby Bluetooth device on TEA. 
Experiment. We place a TEA sender in location 1 
(Fig. 5) and make other nodes act as TEA receivers. We 
co-locate a Bluetooth device next to the TEA sender. The 
sender periodically sends an announcement. The receivers 
first detect the synchronization packets, decode the CTS- 
to-SELF, and then try to verify the slots. If the receiver 
can successfully verify, it declares success. Otherwise, it 
attempts to verify the slots in the next time period. 
Results. Fig. 12 shows the CDF of the number of 
required attempts before a TEA receiver succeeds in re- 
ceiving a correct TEA. Bluetooth transceivers operate on 
79 bands in 2402-2480 MHz and frequently jump across 
these bands. Thus, the probability that they interfere with 
TEA in successive runs of the protocol is relatively low. 
The figure shows that, even in the presence of Bluetooth 
devices which cannot decode a CTS-to-SELF, a TEA re- 
ceiver requires 1.4 attempts on average, and 4 attempts 
maximum, before it receives the announcement. 


8 CONCLUSION 


This paper presented Tamper-Evident Pairing (TEP), the 
first wireless pairing protocol that works in-band, with 
no pre-shared keys, and protects against MITM attacks. 
TEP relies on a Tamper-Evident Announcement (TEA) 
mechanism, which guarantees that an adversary cannot 
tamper with either the payload in a transmitted message, 
or with the fact that the message was sent. We formally 
proved that the design protects from MITM attacks. Fur- 
ther, we implemented a prototype of TEA and TEP for the 
802.11 wireless protocol using off-the-shelf WiFi devices, 
and showed that TEP is practical on real-world 802.11 
networks and devices. 


USENIX Association 


ACKNOWLEDGMENTS 


We thank Ramesh Chandra, James Cowling, Haitham Hassaneih, 
Nate Kushman, Jad Naous, Benjamin Ransford, and our shep- 
herd Diana Smetters for their insightful comments. We also 
thank Jukka Suomela and Piotr Indyk for help with the efficient 
bit-balancing algorithm in the Appendix. This work is funded 
by NSF and SMART-FM. 


REFERENCES 


[1] Atheros linux wireless driver. http://wireless. 
kernel.org/en/users/Drivers/ath5k. 

[2] D. Balfanz, G. Durfee, D.K.Smetters, and R. Grinter. In 
search of usable security — five lessons from the field. IEEE 
Journal on Security and Privacy, 2(5):19-24, September— 
October 2004. 

[3] S.M. Bellovin and M. Merritt. Encrypted key exchange: 
Password-based protocols secure against dictionary at- 
tacks. In Proceedings of the 13th IEEE Symposium on 
Security and Privacy, Oakland, CA, May 1992. 

[4] V. Boyko, P. MacKenzie, and S. Patel. Provably secure 
password-authenticated key exchange using diffie-hellman. 
In B. Preneel, editor, Advances in Cryptology—Eurocrypt 
2000, volume 1807 of Lecture Notes in Computer Science, 
pages 156-171. Springer-Verlag, 2000. 

[5] M. Cagalj, J.-P. Hubaux, S. Capkun, R. Rangaswamy, 
I. Tsigkogiannia, and M. Srivastava. Integrity codes: Mes- 
sage integrity protection and authentication over insecure 
channels. In Proceedings of the 27th IEEE Symposium on 
Security and Privacy, pages 280-294, Oakland, CA, May 
2006. 

[6] S. Capkun, M. Cagalj, R. Rengaswamy, I. Tsigkogiannis, 
J.-P. Hubaux, and M. Srivastava. Integrity codes: Mes- 
sage integrity protection and authentication over insecure 
channels. IEEE Transactions on Dependable and Secure 
Computing, 5(4):208—223, October-December 2008. 

[7] W. Diffie and M. E. Hellman. New directions in cryptogra- 
phy. JEEE Transactions on Information Theory, 22(6):644— 
654, November 1976. 

[8] W. Diffie, P. C. van Oorschot, and M. J. Wiener. Authenti- 

cation and authenticated key exchanges. Designs, Codes, 

and Cryptography, 2(2):107-125, 1992. 

Ettus Inc. Universal software radio peripheral. http: 

//ettus.com. 

[10] M. T. Goodrich, M. Sirivianos, J. Solis, G. Tsudik, and 
E. Uzun. Loud and clear: human-verifiable authentication 
based on audio. In Proceedings of the 26th International 
Conference on Distributed Computing Systems, Lisboa, 
Portugal, July 2006. 

{11] J. D. Halamka. Telemonitoring for the home. 
http://geekdoctor.blogspot .com/2010/ 


[9 


= 


04/telemonitoring—for-home.html, April 
2010. 

[12] IEEE. 802.15.1 specification: Personal area networks, 
2002. 


[13] IEEE. 802.11i specification: Amendent 6: MAC security 
enhancements, 2004. 

[14] Kelton Research. Survey: 
network an essential element 


Protecting wireless 
of home security. 


USENIX Association 


http://www.wi-fi.org/news_articles. 
php? f=media_news&news_id=1, November 2006. 

[15] C. Kuo, J. Walker, and A. Perrig. Low-cost manufac- 
turing, usability and security: An analysis of bluetooth 
simple pairing and wi-fi protected setup. In Proceedings 
of the Usable Security Workshop, Lowlands, Scarborough, 
Trinidad/Tobago, February 2007. 

[16] R. Li. WiFi hitting the security camera scene. eZine Ar- 
ticles, March 2010. http: //ezinearticles.com/ 
?id=3963601. 

[17] R. Mayrhofer and H. Gellersen. Shake well before use: Au- 
thentication based on accelerometer data. In Proceedings 
of the Sth International Conference on Pervasive Comput- 
ing, Toronto, Canada, May 2007. 

[18] J. M. McCune, A. Perrig, and M. K. Reiter. Seeing-is- 
believing: using camera phones for human-verifiable au- 
thentication. In Proceedings of the 26th IEEE Symposium 
on Security and Privacy, Oakland, CA, May 2005. 

[19] D. A. Norman. The way I see it: When security gets in the 
way. Interactions, 16(6), November—December 2009. 

[20] V. Roth, W. Polak, E. Rieffel, and T. Turner. Simple and 
effective defense againgst evil twin access points. In Pro- 
ceedings of the Ist ACM Conference on Wireless Network 
Security, Alexandria, VA, March—April 2008. 

[21] SensorMetrics, Inc. Intellisense WiFi products: Tem- 
perature sensors, motion sensors, power sensors. http: 
//www.sensormetrics.com/wifi.html. 

[22] F. Stajano and R. Anderson. The Resurrecting Duckling: 
Security Issues for Ad-hoc Wireless Networks. In Pro- 
ceedings of the 7th International Workshop on Security 
Protocols, 1999. 

[23] J. K. Tan. An Adaptive Orthogonal Frequency Division 
Multiplexing Baseband Modem for Wideband Wireless 
Channels. Master’s thesis, MIT, 2006. 

[24] C. Ware, J. Judge, J. Chicharo, and E. Dutkiewicz. Un- 
fairness and capture behavior in 802.11 adhoc networks. 
In Proceedings of the IEEE International Conference on 
Communications, 2000. 

[25] WiFi Alliance. WPS Certified Products. http://www. 
wi-fi.org/search_products.php. 

[26] WiFi Alliance. WiFi protected setup specification, version 
1.0h, 2006. 

[27] WiFi Alliance. WiFi Alliance to ease setup of home 
WiFi networks with new industry wide program. 
http://www.wi-fi.org/news_articles. 
php? f=media_news&news_id=263, January 2007. 


A BIT-BALANCING ALGORITHM 


TEA’s bit-balancing algorithm takes an even number, N, of input 
bits and produces M = N + 2[logN] output bits which have an 
equal number of zeros and ones. If the input sequence has an 
odd number of bits, we pad a | bit to it to make it an even length 
sequence. 

Let the input bit sequence of our algorithm be denoted by 
IN, and the output bit-balanced sequence be denoted by OUT. 
We define Do to be the difference between the number of ones 
and zeros in the input JN. Also D; is defined as the difference 
between the number of ones and zeros after flipping the first i 
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Input Sequence: 1000, Dog =—2 
i=l: 1000 — 0000, Dj =-4 
i=2: 0000 — 0100, D»=-2 
i=3: 0100 — 0110, D3= 0 


Output Sequence: 01101001 
Table 2: Example run of our 0-1 balanced function 


bits in the input JV. Our algorithm works as follows. 


e Step 1: Compute the difference Do between the number of 
ones and number of zeros in IN. Set i to 1 and So to IN. 

e Step 2: Flip the i bit in S;_; to get S;. Then compute the 
new difference, D; as Dj = D;_| +2 depending on whether 
the i pit is one or zero. 

e Step 3: If Dj = 0, then set INDEX toi and OUT jemp to S; and 
go to Step 4. Otherwise increment i and go to Step 2. 

e Step 4: Set the output OUT to be the concatenation of 
OUT emp and the Manchester encoding of the bit represen- 
tation of INDEX — 1. Since Sjypgx is N bits long and the 
Manchester encoding of INDEX — 1 is 2[{/og(N)]| bits long, 
the output OUT is N +2[log(N)] bits long. 





To see how the above algorithm works, let us take the 4 bit 
input sequence, 1000, shown in Table 2. The difference Do for 
this sequence is —2. In the first iteration, we flip the first bit to 
get the bit sequence 0000 which has a difference Dj = —4. In 
the second iteration, we flip the second bit to get 0100 which has 
a difference D2 = —2. Finally, in the third iteration, we flip the 
third bit to get 0110 which has a difference D3 = 0. Thus, we 
output this sequence concatenated with the Manchester encoding 
of 3 — 1, which is 1001. Thus, the bit balanced output sequence 
is 01101001. 

The above algorithm relies on the fact that there exists an 
INDEX bit position for which Djypgzy = 0. Such an INDEX al- 
ways exists for the following reason. First, because the sequence 
So has an even number of bits, Do is even. Further, for every bit 
flipped, D; differs from D,;_; by exactly +2. Finally, since Sy is 
the bitwise opposite of Sg and thus Dy = —Dog, there must exist 
an INDEX for which Diypgx = 0. 

Note that this is a one-to-one mapping and the decoding 
can be done in linear time. Specifically the decoder takes the 
last 2[/og(N)]| bits and constructs INDEX from its Manchester 
encoding. Then it takes the first NV bits and flips the first IVDEX 
bits in the first NV bits to get the original bit sequence. 





B REDUCING MEDIUM OCCUPANCY 


TEP’s specifications in §5.2 ensure that if there is any possibility 
that a registrar missed a TEA request (i.e., if TEA_RECV_GET 
returned RETRY or OVERLAP), that registrar will immediately 
transmit a TEA reply, without regard for carrier-sense. Thus, if, 
by some chance, multiple registrars transmit overlapping replies 
at almost the same time, each of them will then assume it may 
have missed a request from some enrollee (since it sensed a 
concurrent TEA message), and each will re-send its reply. This 
cycle of replies may continue until each registrar’s PBC walk 
time (120 sec) expires. This section shows how to modify the 
basic TEP protocol to avoid occupying the medium for 120 
seconds in this situation. 
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To address this issue, we make two changes to the TEP pro- 
tocol from §5.2. First, the registrar does not re-transmit replies 
if all of the possibly-missed TEA requests overlapped with its 
previous transmission. In other words, the registrar performs 
TEA_SEND only if m 4 OVERLAP. Not re-sending the reply 
is safe only if enrollees whose requests may not get a reply 
also learn of the TEA overlap (and thus return a session overlap 
error). To guarantee this, we make a second change, to the 
enrollee, so that it listens for tea_duration both before and af- 
ter transmitting its request. This ensures that an enrollee hears 
any TEA replies (from registrars) that overlap its own TEA 
request. (As before, if an enrollee detects a TEA message it 
cannot decode, it triggers a session overlap error.) Thus, the en- 
rollee pseudo-code is augmented as follows (changing the loop 
duration and introducing an additional SLEEP before sending): 

r—O 

for 120 sec+#channels x (tx_tmo + 3 x tea_duration) do 

> walk time + max enrollee scan period 
switch to next 802.11 channel 
h — TEA_RECV_START (reply) 
SLEEP (tea_duration) 
TEA_SEND (request, enroll_info, now + tx_tmo) 
SLEEP (tea_duration) 
r<—rU TEA RECV.GET (h) 

end for 

The registrar must also wait for the same increased loop time 
to accommodate the modified enrollee. With these changes, 
TEP safely avoids occupying the medium for the whole walk 
time in cases when multiple registrars hear each other’s replies. 


B.1_ Extending the Security Proof 


Next, we prove that the above optimization is secure. 


Proposition B.1 An enrollee and registrar following the opti- 
mized TEP protocol (from Appendix B) cannot be tricked into 
accepting an incorrect public key, as in Prop. 7.4. 


Proof The only change in the optimized protocol that affects 
the proof for Prop. 7.4 is that the registrar does not resend its 
reply when m = OVERLAP. The registrar still computes the 
same set e of enrollee messages as in Prop. 7.4, and therefore 
cannot be tricked into accepting an incorrect public key. 

We prove that the enrollee also cannot be tricked, by contra- 
diction. Suppose that the enrollee is tricked into accepting the 
wrong key. From the proof of Prop. 7.4, this must be because 
the registrar did not respond to the enrollee’s request. This must 
be because the registrar’s TEA_RECV_GET returned OVERLAP, 
i.e., the registrar missed zero or more requests, all of which 
overlapped the registrar’s TEA_SEND. Thus, the enrollee must 
have transmitted its request within tea_duration of the registrar 
transmitting a reply. 

By the enrollee pseudo-code in Appendix B, the enrollee was 
listening for TEA messages for tea_duration before and after 
sending its request. If the enrollee’s TEA_RECV_GET returned 
the registrar’s reply, the enrollee could not have accepted a dif- 
ferent key (by Prop. 7.4). Thus, the enrollee’s TEA _RECV_GET 
must have returned RETRY or OVERLAP . But in both of these 
cases, the enrollee would not have accepted any key. Thus, by 
contradiction, the enrollee cannot be tricked, and the optimized 
TEP protocol is secure. 
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Abstract 


Current disk encryption techniques store necessary keys 
in RAM and are therefore susceptible to attacks that tar- 
get volatile memory, such as Firewire and cold boot at- 
tacks. We present TRESOR, a Linux kernel patch that 
implements the AES encryption algorithm and its key 
management solely on the microprocessor. Instead of us- 
ing RAM, TRESOR ensures that all encryption states as 
well as the secret key and any part of it are only stored 
in processor registers throughout the operational time of 
the system, thereby substantially increasing its security. 
Our solution takes advantage of Intel’s new AES-NI in- 
struction set and exploits the x86 debug registers in a 
non-standard way, namely as cryptographic key storage. 
TRESOR is compatible with all modern Linux distribu- 
tions, and its performance is on a par with that of stan- 
dard AES implementations. 


1 Introduction 


Disk encryption is an increasingly used method to protect 
confidential information in computer systems. It is par- 
ticularly effective for mobile systems, such as laptops, 
since these are frequently lost or stolen [29]. With the 
growing availability of disk encryption systems, crimi- 
nals and law enforcement alike have started to explore 
ways to circumvent this protection. Since current disk 
encryption techniques store keys in main memory, one 
approach to access the encrypted data is to acquire the 
key physically. 

When physical access to the machine is given, keys 
can be extracted from main memory of running and sus- 
pended machines without privileged user access. Such 
attacks can broadly be classified into DMA attacks and 
cold boot attacks. DMA attacks use direct memory ac- 
cess through ports like Firewire [4, 3, 5], PCI [6, 28, 13], 
or PC Card [8, 17] to access RAM, while cold boot at- 
tacks [14] exploit the fact that memory contents fade 
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away gradually over time. This allows to restore RAM 
contents after a short power-down by rebooting the ma- 
chine with a boot device that directly reads out memory. 
Widespread disk encryption systems like BitLocker [1] 
(Windows), FileVault (MacOS), dm-crypt [30] (Linux), 
and TrueCrypt [36] (multi-platform) do not protect 
against such attacks. The current technological response, 
namely to keep the key in RAM but obfuscate its pres- 
ence by using dispersal techniques, only partly counters 
the threat of memory attacks. 


1.1 TRESOR 


In this paper, we present the design and implementa- 
tion of TRESOR (pronounced [tre:zoa]), a Linux kernel 
patch for the x86 architecture that implements AES in a 
way that is resistant to the attacks mentioned above and 
hence, allows for disk drive encryption with improved 
security. 

TRESOR runs encryption securely outside RAM. Its 
underlying idea is to avoid RAM usage completely by 
both storing the secret key in CPU registers and running 
the AES algorithm entirely on the microprocessor. To- 
wards this goal, TRESOR (mis)uses the debug registers 
as secure cryptographic key storage. While the princi- 
ple of TRESOR is basically applicable to most x86 com- 
patible CPUs, we focus on an implementation exploiting 
Intel’s new AES-NI [31] extensions. The new AES in- 
structions, currently available on all Core 17 processors 
and most Core i5, allow for accelerated AES using short 
and efficient code which implements most of the crypto- 
graphic primitive in hardware. 

On systems running TRESOR, setting hardware 
breakpoints is no longer possible because the breakpoint 
registers are occupied with key data. However, as there 
are only four breakpoint registers, debuggers like GDB 
must deal with the possibility that all of them are busy 
anyway, for example, when more than four are set in par- 
allel. 
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1.2 Related Work 


TRESOR is the successor of AESSE [24], which was 
our prototype implementation but not well applicable in 
practice because it incurred two major problems. First, 
the Streaming SIMD Extension (SSE) [34] were used 
as key storage, breaking binary compatibility with many 
multimedia, math, and 3d applications. Second, AESSE 
was a pure software implementation and, due to the 
shortage of space inside CPU registers, the algorithm 
performed about six times more slowly than comparable 
standard implementations of AES. 

During the work on TRESOR, Simmons indepen- 
dently developed a system called Loop-Amnesia [27] 
which pursues the same idea of holding the crypto- 
graphic key solely in CPU registers. In difference to 
TRESOR, Loop-Amnesia stores the key inside machine 
specific registers (MSRs) rather than in debug registers. 
Currently, it does not support the AES-NI instruction set 
and only a 128-bit version of AES. However, it allows to 
store multiple disk encryption keys securely inside RAM 
by scrambling them with a master key. 

With BitArmor [23] there exists a commercial solution 
that claims to be resistant against cold boot attacks in par- 
ticular. But as BitArmor does not generally avoid storing 
the secret key in RAM, it cannot protect from other at- 
tacks against main memory. Consequently, its cold boot 
resistance is not perfect, too (though quite good to resist 
the most common attacks of this kind). 

Additionally, Pabel proposed a solution called Frozen 
Cache [21, 22] that exploits CPU caches rather than reg- 
isters as secure key storage outside RAM. To our knowl- 
edge, this project is currently work in progress at an early 
development stage. Although it is a nice idea, a secure 
and efficient implementation is very difficult, if possible 
at all, because x86 caches can hardly be controlled by the 
system programmer. 

Last, hardware solutions like Full Disk Encryption 
Hard Drives (HDD-FDE) (16, 35] use specialized crypto 
chips for encryption instead of the system CPU and 
RAM. Indeed, this is an effective method to defeat mem- 
ory attacks, but it does not compete with TRESOR as a 
software solution. In our opinion, software solutions are 
not obsolete as they have several advantages: they are 
cheaper, highly configurable, vendor independent, and, 
last but not least, quickly employable on many existing 
machines. 


1.3. Contributions 


The central innovations of TRESOR are storing the se- 
cret key in CPU registers and utilizing AES-NI for en- 
cryption. AES-NI offers encryption (and decryption) 
primitives directly on the processor. This, however, 
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does not mean that AES-NI based implementations of 
AES withstand memory attacks out-of-the-box. A typ- 
ical AES-NI based implementation uses RAM to store 
the secret key and, for reasons of performance, the key 
schedule. The AES key schedule is required by the indi- 
vidual AES rounds and is generally computed only once 
and then stored inside RAM to speed up the encryption 
process. In contrast, TRESOR implements AES using 
AES-NI without leaking any key-related data to RAM. 
The contributions of this paper are: 


e We implement AES without storing any sensitive 
information in RAM. 


e To this end, we present a kernel patch (TRESOR) 
that is binary compatible with all Linux distribu- 
tions. 


We show that by using Intel’s new AES-NI instruc- 
tions, the performance of TRESOR is as fast (even 
slightly faster) than standard AES implementations 
that use RAM. 


e By running TRESOR in a virtual machine and con- 
stantly monitoring its main memory, we demon- 
strate that TRESOR can withstand considerable ef- 
forts to compromise the encryption key. The only 
method to access the key with reasonable effort is 
compromising the system space, using a loadable 
kernel module, for example. Many other attacks, 
such as hardware attacks targeting processor regis- 
ters, are defeated by TRESOR. 


Overall, TRESOR is a disk encryption system that is 
both secure against main memory attacks and well appli- 
cable in practice. 


1.4 Outline 


The rest of this paper is structured as follows: In Sec- 
tion 2 we explain our design choices and give imple- 
mentation details. We have evaluated TRESOR regard- 
ing three aspects: compatibility (Section 3), performance 
(Section 4) and, most importantly, its security (Sec- 
tion 5). We conclude in Section 6. 


2 Design and Implementation 
We now give an overview over design choices regarding 


the interface and the implementation of TRESOR. 


2.1 Security Policy 


The goal of TRESOR is to run AES entirely on the mi- 
croprocessor without using main memory. This implies 
that neither the secret key, nor the key schedule, nor any 
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intermediate state should ever get into RAM. With this 
restrictive policy, any attacks against main memory be- 
come useless. But such an implementation cannot be 
achieved simply in user space for two reasons: 


e First of all, user space is affected by scheduling, 
meaning that CPU registers are frequently swapped 
out to RAM due to context switching. That is the 
key and/or intermediate states of AES would reg- 
ularly enter RAM — even though AES was imple- 
mented to run solely on the microprocessor. 


e Second, the key storage registers should not be ac- 
cessible from unprivileged tasks. Otherwise a local 
attacker could easily read out and overwrite the key. 


Both problems can only be solved by implementing 
TRESOR in kernel space. To suppress context switching, 
we run AES atomically. The atomic section is entered 
just before an input block is encrypted and left again 
right afterwards. Therefore, we can use arbitrary CPU 
registers to encrypt a block; we just have to reset them 
before leaving the atomic section. This guarantees that 
no sensitive data leaks into RAM by context switches. 
Between the encryption of two blocks, scheduling and 
context switches can take place as usual, so that the in- 
teractivity of multitasking environments is not affected. 

To restrain userland from reading out the secret key, it 
is stored inside a CPU register set accessible only with 
ring 0 privileges. Any attempt to read or write the debug 
registers from other privilege levels generates a general- 
protection exception [18]. This defeats attackers who 
gained local user privileges and try to read out the key 
on software layer. 

Due to the necessity to implement TRESOR in sys- 
tem space, we choose Linux for our solution because of 
its open source kernel. But in general our approach is 
portable to any x86 operating system. 


2.2 Key management 


AES uses a symmetric secret key for encryption and de- 
cryption. We now show how this key is managed by 
TRESOR. 


Key storage 


The first question regarding key management is: In 
which registers is the key stored within the proces- 
sor? We now discuss several requirements these registers 
should meet. 

Since the key registers are exclusively reserved over 
the entire uptime of the system, they will not be avail- 
able for their designated use. Hence, to preserve bi- 
nary compatibility with as many existing applications as 
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possible, only seldom used registers are qualified to act 
as cryptographic key storage. Frequently used registers, 
like the general purpose registers (GPRs), are not an op- 
tion since all computer programs need to read from and 
write to those registers. The loss of registers occupied by 
TRESOR should not break binary compatibility. 

Another requirement is that the key registers should 
not be readable from user space as this would allow any 
unprivileged process to read or write the secret key. A 
key stored in GPRs, for example, could not be hidden 
from userland as the GPRs are an unprivileged resource, 
available to all processes. 

Last but not least, the register set must be large enough 
to hold AES keys, i.e., 128 bits for AES-128, 192 bits 
for AES-192, and 256 bits for AES-256, respectively. A 
single register is too small to hold AES keys — on both 
32- and 64-bit systems and thus, we have to use a set of 
registers. 

Summarizing, a register set must satisfy four require- 
ments to act as cryptographic key storage. The key reg- 
isters must be: 


1. seldom used by everyday applications, 

2. well compensable in software, 

3. a privileged resource, and 

4. large enough to store at least 128, better 256 bits. 


After considering all x86 registers we chose the de- 
bug registers, because they meet these requirements as 
we explain now. 

The debug register set comprises four breakpoint reg- 
isters drO to dr3, one status register dr 6 and one con- 
trol register dr7. Depending on the operating mode, 
dr4 and dr5 are reserved or just synonyms for dr6 and 
dr7. Thus, the only registers which can be freely set to 
any value, are the four breakpoint registers dr0 to dr3. 
On 32-bit systems these have 4 x 32 = 128 bits in to- 
tal, just enough to store the secret key of AES-128. But 
on 64-bit systems these have 4 x 64 = 256 bits in total, 
enough to store any of the defined AES key lengths 128, 
192, and 256 bits. ! 

The actual intention of breakpoint registers is to hold 
hardware breakpoints and watchpoints — features which 
are only used for debugging. And even for debugging 
their functionality can be compensated quite well in soft- 
ware, because software breakpoints can be used instead 
of hardware breakpoints. 'TRESOR reserves all four 
x86 breakpoint registers exclusively as key storage, i.e., 
TRESOR reduces the number of available breakpoint 
registers for other applications from at most 4 to always 


' Although the principle of TRESOR is applicable to 32-bit systems, 
we recommend the usage of 64-bit CPUs to support full AES-256. In- 
tel’s Core-i processors are such 64-bit CPUs; these processors are also 
recommended because of their AES-NI support. 
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0. Since the breakpoint registers may be in use anyhow, 
by debuggers for example, unavailability of them can 
happen regularly as well, and thus, applications should 
be able to tolerate lack of them. This ensures binary com- 
patibility with almost all user space programs. 

Debug registers are a privileged resource of ring 0, 
meaning that none of the user space applications running 
in ring 3 can access debug registers directly. Any such 
access is done via system calls, namely via ptrace. 
As we show later, we patched the pt race system call 
to return -EBUSY whenever a breakpoint register is re- 
quested, to let the user space know all of them are busy. 





Key derivation 


The key we store in debug registers is derived from a user 
password by computing a SHA-256 based message di- 
gest. To resist brute force attacks, we strengthen the key 
by applying 2000 iterations of the SHA-256 algorithm. 
The password consists of 8 to 53 printable characters” 
and is read from the user early during boot by an ASCII 
prompt, directly in kernel space. Only in kernel space 
we have full control over side effects like scheduling and 
context switching. 

But how do we actually compute the key and get it into 
debug registers without using RAM? The answer is, that 
we do use RAM for this transaction — but only for a very 
short time frame during system startup. Although there 
is a predefined implementation of SHA-256 in the ker- 
nel, we implemented our own variant to ensure that all 
memory lines holding sensitive information, like parts of 
the key or password, are erased after usage. That is, dur- 
ing boot, password and key do enter RAM very briefly. 
But immediately afterwards, the key is copied into debug 
registers and all memory traces of it are overwritten. All 
this happens before any userland process comes to life. 

Once the key has been entered and the machine is up 
and running, it cannot be changed from user space dur- 
ing runtime as it would be impossible to do so without 
polluting RAM. The password must only be re-read upon 
ACPI wakeup, because during suspend mode, the CPU is 
switched off and its context is copied into main memory. 
Naturally, we bar the debug registers from being copied 
into RAM, and hence, the key is lost during suspension 
and the password must be re-entered. Again, this hap- 
pens early in the wakeup process, directly in kernel space 
before any user mode process is unfrozen. 

On 64-bit systems, like Intel’s Core-i series, we copy 
always 256 key bits into the debug registers and each 
of the AES variants (AES-128, AES-192, and AES-256) 
takes as many bits from the key storage as it needs. On 


2 More characters do not add to security as it becomes easier to 
attack the key itself rather than the password because 9553 >> 2256, 
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multi-core processors, we copy the key bits into the de- 
bug registers of all CPUs. Otherwise we constantly had 
to ensure that encryption runs on the single CPU which 
holds the key. In terms of performance such migration 
steps are very costly and it is more efficient to duplicate 
the AES key onto all CPUs once. Furthermore, this al- 
lows us to run several TRESOR tasks in parallel. 


2.3 AES implementation 


The challenge we faced was implementing the AES algo- 
rithm without using main memory. This implies we were 
not allowed to store runtime variables on the stack, heap 
or anywhere else in the data segment. Naturally, our im- 
plementation was written in assembly language, because 
neither the usage of debug registers as key storage nor 
the avoidance of the data segment is supported by any 
high-level language compiler. 


Encryption algorithm 


Storing only the secret key in CPU registers would al- 
ready defeat common attacks on main memory, but fol- 
lowing our security policy mentioned above, absolutely 
no intermediate state of AES and its key schedule should 
get into RAM. This aims to thwart future attacks and 
cryptanalysis. In other words, after a plaintext block is 
read from RAM, we write nothing but the scrambled out- 
put block back. No valuable information about the AES 
key or state is visible in RAM at any time. 

From earlier experiments with AESSE [24], we were 
concerned about the performance penalty of encryption 
methods implemented without RAM. We therefore in- 
vestigated the utilization of the AES-NI instruction set 
of new Intel processors [31]. AES-NI allows for hard- 
ware accelerated implementations of AES by providing 
the instructions aesenc, aesenclast, aesdec, and 
aesdeclast. Each of them performs an entire AES 
round with a single instruction, exclusively on the pro- 
cessor without involving RAM. Hence, they are compat- 
ible with our design. 

Overall, utilizing AES-NI has several advantages: 


e The code is clear and short. 
e It runs without RAM usage. 
e It is highly efficient. 


The four AES instructions mentioned above work on 
two operands, the AES state and an AES round key; the 
round key is used to scramble the state. Instead of using 
memory locations for these operands, the AES instruc- 
tions work on SSE registers. On 64-bit systems there 
are sixteen 128-bit SSE registers xmm0 to xmm15. AES 
states and AES round keys exactly fit into one SSE reg- 
ister as they encompass 128 bits, too. 
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pxor sxmm0, %xmm15 
aesenc Sxmml, %Sxmm15 
aesenc Sxmm2, %xmm15 
aesenc Sxmm3, %xmm15 
aesenc Sxmm4, %xmm15 
aesenc Sxmm5, %xmm15 
aesenc Sxmm6, %xmm15 
aesenc Sxmm7, %xmm15 
aesenc Sxmm8, %xmm15 
aesenc Sxmm9, %xmm15 
aesenclast sxmm10, %xmm15 








s 


Figure 1: AES-128 encryption using AES-NI 


Figure | shows assembly code of the AES-128 encryp- 
tion algorithm. Each line performs one of the ten encryp- 
tion rounds. The second parameter (xmm15) represents 
the AES state and the first ten (xmm1 to xmm10) the AES 
round keys. Writing a plaintext block into xmm15, the 
secret key into xmm0, and the round keys into their re- 
spective registers xmm1 to xmm10, these ten lines of as- 
sembly code suffice to generate an AES encrypted output 
block in xmm15. 

The decryption algorithm of AES-128 basically looks 
the same, just utilizing aesdec instead of aesenc and 
applying the round keys in reverse order. The imple- 
mentations of AES-192 and AES-256 basically look the 
same, too, just performing twelve or 14 rounds instead of 
ten. 


Round key generation 


The difficulty to implement AES completely within the 
microprocessor stems from the structure of the AES al- 
gorithm. As shown in Figure 1, encryption works with 
round keys which we assumed to be stored in xmm1 to 
xmm10. In conventional AES implementations, these 
round keys are calculated once and then stored inside 
RAM over the entire lifetime of the system. Only when 
needed, they are copied from RAM into SSE registers to 
be used in combination with AES-NI. 

In TRESOR we cannot calculate the AES key sched- 
ule beforehand and store it inside RAM as this would 
obviously violate our security policy. On the other hand, 
we cannot store the entire key schedule in CPU registers 
either, because debug registers are too small to hold it and 
we do not want to occupy further registers for TRESOR. 
Consequently, we have to use an on-the-fly key schedule 
that recalculates the round keys each time when enter- 
ing the atomic section. This means the round keys must 
be recalculated for each input block. Inside the atomic 
section we can safely store the round keys inside SSE 
registers as they are known not to be swapped out during 
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this period. 

Fortunately, the AES-NI extensions comprise an in- 
struction for hardware accelerated round key generation, 
namely aeskeygenassist. Apparently, recomput- 
ing the entire key schedule again and again is a signifi- 
cant performance drawback compared to standard imple- 
mentations of AES. By using this specialized instruction, 
key generation is relatively efficient, as we show later in 
this paper. 


«macro key_schedule last next rcon 
pxor oxmm14, sxmm14 
movdqu \last, \next 
shufps SOx1f, \next, ¢xmm14 
pxor $xmm14, \next 
shufps SO0x8c, \next, $xmm14 
pxor $xmm14, \next 


aeskeygenassist $\rcon, \last, %xmm14 


shufps SOxff, sxmm14, $xmm14 
pxor $xmm14, \next 
.endm 


Figure 2: AES-128 round key generation 


Figure 2 lists assembly code to generate the next 
round key of AES-128. As each round key computa- 
tion is based on slightly different parameters, we de- 
fine a macro called key_schedule awaiting these pa- 
rameters: last is an SSE register containing the pre- 
vious round key, next is one that is free to store 
the next round key and rcon is an immediate byte, 
the round constant. Inside this macro xmm14 is uti- 
lized as temporary helping register. To generate the 
ten round keys of AES-128, key_-schedule has to 
be called ten times: key_schedule %xmm0 %xmm1 
Oxl, key_schedule %xmml %xmm2 0x2, and so 
on. Initially the secret key has to be copied from debug 
registers into xmm0.° 

Using AES-NI it is more complex to generate round 
keys than to actually scramble or unscramble blocks, be- 
cause with aeskeygenassist Intel provides an in- 
struction to assist the programmer in key generation, but 
none to perform it autonomously. We conjecture that 
this is because key generation of the three AES variants 
AES-128, AES-192, and AES-256 differs slightly (for 
details see the original standard on AES [12]). 


2.4 Kernel patch 


Many operating system issues have to be solved when 
implementing encryption solely on processor registers. 


3A full source code listing including all steps can be found in Ap- 
pendix A.1. 


20th USENIX Security Symposium 255 


256 


As mentioned above, we have to patch the OS kernel for 
two reasons: First, we have to run parts of AES atomi- 
cally in order to ensure that no intermediate state leaks 
into memory during context switches. Second, only in 
kernel space we can protect the debug registers from be- 
ing overwritten or read out by unprivileged user space 
threads. We chose the most recent Linux kernel at that 
time (version 2.6.36) to implement these changes. 


Key protection 


For the security of TRESOR it is essential to protect the 
key storage against malicious user access. Even if no lo- 
cal attacker would read the debug registers on purpose, 
the risk remains that a debugger is started accidentally 
and pollutes the key storage. With a disk encryption sys- 
tem being active in parallel, such a situation would im- 
mediately lead to data corruption. Hence, the kernel must 
be patched in a way that it denies any attempt to access 
debug registers from user space. 


int ptrace_set_debugreg 
(tsk_struct *t,int n,long v) 
{ 
thread_struct *thread = 
int re = 0; 
if (n == || n == 5) 
return —-EIO; 


& (t->thread) ; 


+ #ifdef CONFIG_CRYPTO_TRESOR 
+ else if (n == {| n == 7) 
+ return —EPERM; 
+ else 
+ return —EBUSY; 
+ #fendif 

if (n == 6) { 


thread->debugreg6 = v; 
goto ret_path; 

} 

if (n < HBP_NUM) { 
rc=ptrace_set_breakpoint_addr(t,n,v)j; 


if (rc) return rc; 
} 
if (n == 7) { 
rce=ptrace_write_dr7(t, v); 
if (!rc) thread->ptrace_dr7 = v; 
} 
ret_path: return rc; 


Figure 3: Patched setter for debug registers 


The debug registers can only be accessed from priv- 
ilege level 0, i.e., from kernel space but not from user 
space. Only the pt race system call allows user space 
applications like GDB to read from and write to them in 
order to debug a traced child. This makes it effectively 
possible to control access to debug registers centrally, 
i.e., on system call level. Running an unpatched Linux 
kernel, user space threads can access debug registers via 
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ptrace; running a TRESOR patched kernel, we filter 
this access. 

Figures 3 and 4 list patches we applied to 
functions of the ptrace implementation in 
/arch/x86/kernel/ptrace.c: ptrace_set_ 
debugreg and ptrace_get_debugreg. The first 
patch returns -EBUSY whenever the user space attempts 
to write into breakpoint registers and -EPERM whenever 
it tries to write into debug control registers. The second 
patch returns just 0 for any read access to debug 
registers. 





long ptrace_get_debugreg(tsk_struct *t, int n) 
{ 
thread_struct *thread = 
unsigned long val = 0; 
+ #ifndef CONFIG_CRYPTO_TRESOR 
if (n < HBP_NUM) { 
struct perf_event «bp; 
bp = thread->ptrace_bps[n]; 
if (!bp) return 0; 
val = bp->hw.info.address; 


& (t->thread) ; 


} 
else if (n == 6) 
val = thread->debugreg6; 
else if (n == 7) 
val = thread->ptrace_dr7; 
+ #endif 
return val; 


Figure 4: Patched getter for debug registers 


Additionally, we patched elementary functions in 
/arch/x86/include/asm/processor -h to pre- 
vent kernel internals other than ours from accessing 
the debug registers: native_set_debugreg and 
native_get_debugreg. While the pt race patches 
prevent user space threads from accessing debug regis- 
ters, these patches prevent the kernel itself from access- 
ing them, e.g., during context switching and ACPI sus- 
pend. 


Atomicity 


The operating system regularly performs context 
switches where processor contents are written out to 
main memory. When TRESOR is active, the CPU con- 
text encompasses sensitive information because our im- 
plementation uses SSE and general purpose registers to 
store round keys and intermediate states. These registers 
are not holding sensitive data persistently, like the debug 
registers do, but they hold them temporarily for the pe- 
riod of encrypting one block. Thus, although our AES 
implementation runs solely on registers and although we 
have patched the kernel to protect debug registers, sen- 
sitive data may still be written to RAM whenever the 
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scheduler decides to preempt AES in the middle of an 
encryption phase. 

We solved this challenge by making the encryption of 
individual blocks atomic. Resetting the contents of SSE 
and general purpose registers before leaving the atomic 
section is an effective method to keep their contents away 
from context switching. Our atomicity does not only 
concern scheduling, but interrupt handling, too, because 
interrupt handlers, spontaneously called by the hardware, 
can write the CPU context into RAM as well. 

Hence, to set up an atomic section we have to disable 
interrupts. On multi-core systems it is sufficient to dis- 
able interrupts locally, i.e., on the CPU the encryption 
task actually takes place on. Other CPUs can proceed 
with their tasks as a context switch on one CPU does not 
affect registers of another. 


preempt_disable(); 
ocal_irq_save (*irg_flags) ; 

// ... (encrypt block) 
ocal_irq_restore(*irgq_flags) ; 
preempt_enable(); 














Figure 5: AES block encryption runs atomically 


Figure 5 illustrates how to set up an atomic sec- 
tion in the Linux kernel that meets our needs. First 
preempt_disable is called to pause kernel pre- 
emption, meaning that running kernel code cannot 
be interrupted by scheduling anymore. Second, 
local_irq_save is called to save the local IRQ 
state and to disable interrupts locally. Next we are 
safe to encrypt an AES block as we are inside the 
atomic section. SSE and general purpose registers 
are only allowed to contain sensitive data within this 
section and must be reset before it is left. Once 
they are reset, local interrupts can be re-enabled (by 
local_irg_restore) and kernel preemption can be 
continued (by preempt_enable). 


Crypto API 


We integrated TRESOR into the Linux kernel Crypto- 
API, an interface for cryptographic ciphers, hash func- 
tions and compression algorithms. Besides a coherent 
design for cryptographic primitives, the Crypto-API pro- 
vides us with several advantages: 


e It allows ciphers to be dynamically (un)loaded as 
kernel modules. We left support for the standard 
AES module untouched and inserted TRESOR as a 
completely new cipher module. This enables end 
users to choose between TRESOR and standard 
AES, to run them in parallel, to compare their per- 
formance, etc. 
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e We do not have to implement cipher modes of op- 
eration, like ECB and CBC, ourselves since the 
Crypto-API handles them automatically. We have 
to provide the code to encrypt a single input block 
only and encrypting larger messages is done by the 
API. 


Existing software, most notably the disk encryption 
solution dm-crypt, is based on the Crypto-API and 
open to new cipher modules. That is, we do not have 
to patch dm-crypt to support TRESOR, but it is sup- 
ported out-of-the-box. (Only third party encryption 
systems which do not rely on the Crypto-API, like 
TrueCrypt, cannot benefit from TRESOR without 
further ado.) 


All in all, integrating TRESOR into the Crypto-API 
simplifies design. However, there is also a little draw- 
back of the Crypto-API: It comes with its own key man- 
agement which is too insecure for our security policy 
because it stores keys and key schedules inside RAM. 
To overcome this difficulty without changing the Crypto- 
API, we pass on a dummy key and look after the real key 
ourselves. Setting up an encryption system, the end user 
can pass on an arbitrary bit sequence as dummy key, but 
for apparent reasons it should not be equal to the real key. 


3 Compatibility 


We evaluated TRESOR regarding its compatibility with 
existing software (Section 3.1) and hardware (Sec- 
tion 3.2). 


3.1 Software compatibility 


Running on a 64-bit CPU, TRESOR is compatible with 
all three variants of AES, i.e., with AES-128, AES-192, 
and AES-256. To verify that no mistake slipped into 
the implementation we show its compatibility to standard 
AES: First of all we used official test vectors as defined in 
FIPS-197 [12]. TRESOR is integrated into the Crypto- 
API in such a way that a test manager proves its cor- 
rectness based on these vectors each time the TRESOR 
module is loaded. Second, we scrambled a partition 
with TRESOR, unscrambled it with standard AES and 
vice versa. Along with structured data like text files and 
the filesystem itself, we created large random files and 
compared both plaintext versions, i.e., before scrambling 
with AES and after unscrambling with TRESOR. We 
compared these files and found them to be equal. This in- 
dicated the correctness of our implementation — not only 
in terms of single, predefined blocks (as test vectors do) 
but also regarding a great amount of random data. 
Thanks to the Crypto-API, TRESOR is compatible 
with all kernel and user space applications relying on 
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> cryptsetup create tr /dev/sdbl -c tresor 
Enter passphrase: 
> mkfs.ext2 /dev/mapper/tr 

mount /dev/mapper/tr /media/tresor/ 





KKK KKK 


Vv 


Figure 6: Create TRESOR partition using cryptsetup 


this API. Among others these are the kernel-based disk 
encryption solution dm-crypt and all its user space fron- 
tends, e.g., cryptsetup and cryptmount. Figure 6 
lists shell instructions to set up a TRESOR encrypted par- 
tition on the device /dev/sdb1. The password can be 
any arbitrary string as it is only used to create the dummy 
key; it has no effect on the actual encryption process. 
Consequently, a partition can be encrypted with the pass- 
word “foobar” and decrypted with the password “magic” 
(as long as the TRESOR key stays the same). 

TRESOR is expected to be compatible with all Linux 
distributions, meaning that all prepackaged user mode 
binaries are expected to run on top of the TRESOR ker- 
nel. For “normal” user mode applications like a shell, 
the desktop environment, your web browser, etc., this is 
pretty much self-evident — for a debugger it is not. But 
even for debuggers like GDB, binary compatibility is not 
broken because access to debug registers is handled via 
ptrace and we intercept this system call to inform the 
user space that all breakpoint registers are busy — a sit- 
uation which could occur without TRESOR as well. To 
be more precise, we have to distinguish breakpoints and 
watchpoints: 


1. Breakpoints: Calling break, GDB does not use 
hardware breakpoints by default. Instead it uses 
software breakpoints because their performance 
penalty is negligible and they can be defined in 
any quantity. Hardware breakpoints must explic- 
itly be invoked by calling hbreak which fails on 
TRESOR with “Couldn’t write debug register: De- 
vice or resource busy.” 


2. Watchpoints: Unlike breakpoints, watchpoints can- 
not be implemented well in software and run about 
a hundred times slower than normal execution [33]. 
Thus, GDB sets hardware watchpoints by default. 
Calling watch fails on TRESOR with “Couldn’t 
write debug register: Device or resource busy.” 
as well. To use software watchpoints instead, 
set can-use-hw-watchpoints 0 must be 
run before. 


Admittedly, not being able to use hardware break- 
points may be a reasonable drawback for malware ana- 
lysts and software reverse engineers. Here we must limit 
the target audience to end-users and “normal” develop- 
ers. 
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3.2. Hardware compatibility 


TRESOR is only compatible with real hardware. Run- 
ning TRESOR as guest inside a virtual machine is gen- 
erally insecure as the guest’s registers are stored in the 
host’s main memory. 

On hardware level, TRESOR’s compatibility is fur- 
ther restricted to the x86 architecture. It is possible to 
run AES entirely on the microprocessor, even without 
an AES-NI instruction set (given that your CPU sup- 
ports at least SSE2, which is the case for Pentium 4 and 
later CPUs). But in order to run full AES efficiently, 
processor compatibility is restricted to Intel’s Core-i se- 
ries at present. More clearly, we recommend the usage 
of 64-bit CPUs supporting the AES-NI instruction set. 
More and more processors will fall into this category in 
the future. Intel supports AES-NI since its mircoarchi- 
tecture code-named Westmere. AMD announced to sup- 
port AES-NI starting with its Bulldozer core; processors 
based on this core are going to be released in 2011 [20]. 
All in all, many, if not most, upcoming x86 CPUs will 
support AES-NI. 


4 Performance 


We present performance measurements running a 64-bit 
Linux on an Intel Core i7-620M. The two performance 
aspects we evaluated are encryption speed and system 
reactivity. The latter may be affected because we halt the 
scheduler and run AES atomically. 


4.1 Encryption benchmarks 


We expected a performance penalty of TRESOR because 
of its recomputation of the key schedule for each input 
block — a substantial computing overhead compared to 
standard implementations which calculate the round keys 
only once. As shown in Section 2.3, round key gener- 
ation is a heavy operation compared to the rest of the 
AES algorithm; and we are running through the entire 
key schedule for each 128-bit chunk, even when encrypt- 
ing megabytes of data. 

To measure the throughput of TRESOR in practice, 
we performed several disk encryption benchmarks. For 
disk benchmarking we mounted four partitions, one en- 
crypted with TRESOR, one encrypted with generic AES, 
one encrypted with common AES-NI, and a plain one 
that was not encrypted at all. We mounted all of them 
with the sync option, meaning that I/O to the filesys- 
tem is done synchronously. We did this to avoid unre- 
alistically high speed measurements that arise from disk 
caching. Caching would falsify our results because en- 
cryption does not take place before the data is actually 
going to disk. 
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Key | Generic AES AES-NI TRESOR Plain 
128 14.67 15.63 17.04 
192 14.89 15.40 16.47 47.32 
256 15.04 15.92 15.77 














Table 1: dd throughput (in MB/s) 


Table | lists average values over 24 ddruns, each writ- 
ing a 400M file.t TRESOR-128 is faster than TRESOR- 
192 which again is faster than TRESOR-256, because 
with an increasing key size, more rounds are performed 
(10, 12 and 14, respectively) and thus, more round keys 
must be calculated on-the-fly. 

The table also shows that TRESOR performs well in 
comparison to conventional AES variants: TRESOR is 
faster than the generic implementation of AES and even 
slightly faster than common AES-NI implementations. 
We were surprised by this ourselves and double-checked 
the results — once with the AES-NI module shipped with 
Linux 2.6.36 and additionally with a self-written vari- 
ant. Currently, we have no good explanation for this ef- 
fect. One possibility is that TRESOR gains advantage 
over other threads due to its atomic sections. Another 
possibility is that linear key generation on registers per- 
forms generally better than fetching round keys one after 
another from RAM. 


and the impact of an on-the-fly round key generation is 
negligible. 


4.2 Interactivity benchmarks 


Performing heavy TRESOR operations in background, 
the OS reactivity to interactive events may be affected 
because TRESOR disables interrupts in order to run en- 
cryption atomically. In a desktop environment, for in- 
stance, mouse and keyboard events are raising interrupts 
which are now delayed until the end of a TRESOR oper- 
ation. Furthermore, automatic scheduling is disabled for 
this period. 

Hence, to preserve the reactivity of the system, we set 
the scope of atomicity to the smallest reasonable unit, 
namely to the encryption of a single 128-bit input block. 
Between processing two 128-bit blocks, interrupt pro- 
cessing and scheduling can take place as usual. Thereby 
interactivity is hardly affected because — as mentioned in 
the last section — it takes only 500 ns on average to en- 
crypt a single block. Assuming it never takes more than 
1000 ns, interrupt handling and scheduling can take place 
each microsecond if needed. But only delays greater 
than 150 milliseconds are perceptible by humans [10] 
and Linux scheduling slices are commonly between 50 
and 200 milliseconds as well. 








Generic AES AES-NI TRESOR Plain 








2.10 2.54 2.80 7.95 
6.92 8.39 9.26 26.23 


read 
write 














Table 2: Postmark benchmarks for AES-256 (in MB/s) 


Besides measuring the throughput of dd, we uti- 
lized the disk drive benchmarking utility Postmark [19]. 
Postmark creates, reads, changes, and deletes many 
small files rather than just writing a single large file. 
As shown in Table 2, TRESOR has an advantage over 
generic AES and common AES-NI here as well. 

Additionally, to measure the exact time needed to en- 
crypt a single block, we wrote a kernel module named 
tresor-test. Inserting this module, diverse perfor- 
mance tests can be run for AES-128, AES-192, and 
AES-256. Findings from this module confirm our as- 
sumption that TRESOR runs faster than standard AES. 
For example, with TRESOR an AES-128 block is en- 
crypted in about 440 nanoseconds (ns), while standard 
AES needs about 538 ns (but these values fluctuate heav- 
ily in practice, by more than 100%, and thus, we consider 
disk benchmarks as more reliable). 

Overall, TRESOR involves no performance penalty 


“The underlying series of tests can be found in Appendix A.2. 


USENIX Association 

















Crypto Add. Load AVG | MAX STD 
Generic None 1.40 | 34.2 4.96 
AES xX 6.73 | 43.0 | 11.30 
Compile 26.80 | 64.7 | 30.30 
TRESOR | None 0.93 | 37.3 3:99 
xX 6.65 | 44.3 | 11.30 
Compile 26.40 | 79.2 | 30.30 
Plain None 0.14 | 26.2 deol 
xX 0.40 | 23.8 2.34 
Compile 0.74 | 32.4 3.47 




















Table 3: Interbench (latencies in ms) 


To prove that interactivity is indeed not affected in 
practice, we draw upon measurements from the bench- 
marking utility Interbench [9]. We used Interbench to 
simulate a video player trying to get the CPU 60 times 
per second, i.e., simulating 60 fps. Table 3 lists a se- 
lection of interactivity benchmarks running on an Intel 
Core i7-620M under different loads.> We disabled all but 
the first CPU core to get a more convincing test set-up. 
As shown by the table, latencies introduced by TRESOR 
do not differ much from those introduced by generic 
AES: Average latencies are slightly better for TRESOR, 


5Full benchmarks are listed in Appendix A.3. 
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maximum latencies are slightly better for generic AES, 
and standard deviations are almost the same. 

Overall, the atomic sections introduced with TRESOR 
are too short to have any measurable effect to the reac- 
tivity of the Linux kernel. 


5 Security 


Although it performs quite well, the ultimately decisive 
factor to employ TRESOR should not be its performance 
but its security qualities. Therefore we prove TRESOR’s 
resistance against attacks on debug registers and, above 
all, attacks on main memory. 


5.1 Memory attacks 


We have implemented AES in a way that nothing but 
the scrambled output block is actively written into main 
memory. However, this alone does not guarantee the se- 
curity of TRESOR, because sensitive data may be copied 
passively into RAM by side effects of the OS or hard- 
ware, such as interrupt handling, scheduling, swapping, 
ACPI suspend modes, etc. For example, we cannot di- 
rectly exclude the possibility that there is a piece of ker- 
nel code reading from debug registers in assembly rather 
than calling our patched native_get_debugreg in 
C. To minimize this risk, we performed extensive tests 
observing the main memory of a TRESOR system at run- 
time. 

The problem we faced was how to observe main 
memory reliably and efficiently. Just reading from 
/proc/kcore or /dev/mem in a running Linux sys- 
tem was not an option as the reading process itself in- 
vokes kernel code which may falsify the result. On the 
other hand, performing real attacks on main memory, like 
cold boot attacks, to read out what is physically left in 
RAM is very time consuming. Thus, we decided to run 
TRESOR as guest inside a virtual machine and to exam- 
ine its “physical” memory from the host. 

As VM we chose Qemu/KVM [11, 2] because it is 
lightweight, has a debug console and — last but not least 
— is compatible with TRESOR (many other VMs are not; 
VirtualBox [25], for instance, does not support AES-NI). 
The debug console of Qemu allows to read CPU regis- 
ters and to take physical memory dumps in a comfortable 
way. 

We started to browse the VM memory of an ac- 
tive disk encryption system with key recovery tools like 
AESKeyFind [15] and Interrogate [7]. As was expected, 
these tools successfully reconstructed the key of standard 
AES but not that of TRESOR. However, this alone is not 
a meaningful result because AES key recovery is com- 
monly based on the AES key schedule (since the secret 
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key itself has no structure; it is just a random bit se- 
quence). As shown in Section 2.3, key schedules are not 
persistently stored under TRESOR and thus, key recov- 
ery must fail — it would even fail if the key actually leaks 
to RAM. 

Unlike real attackers, we are aware of the secret key. 
We took advantage of this knowledge and searched for 
the key bit pattern. Overall, we could find a bit sequence 
matching the key of generic AES but none matching that 
of TRESOR. However, even these findings do not nec- 
essarily imply that the key is not present in RAM, be- 
cause it could be stored discontinuously. This is not even 
unlikely, because inside the CPU it is stored discontinu- 
ously as well (in four breakpoint registers, 64-bit each). 
Context switching may store each register separately, for 
example. 

Consequently, we had to perform more meaningful 
tests taking fractions of the key into account. And thus, 
we sought after the longest match of the key pattern and 
its reverse and any parts of those, in little and in big en- 
dian. We did not observe any case under TRESOR where 
the longest match exceeded three bytes. And matches 
of no more than three bytes can be explained purely by 
probabilities (as also attested by searching for random bit 
sequences instead of real key fractions). 

This further raised our confidence that neither the se- 
cret key nor any part of it was in RAM -— at the time we 
took the memory dump. This leads us to the immediately 
next problem: How can we ensure that main memory 
does not hold any part of the key at other times? In prin- 
ciple, this question is impossible to answer fully because 
of the intricacies of information leakage. In practice, it is 
hardly feasible to put the Linux kernel into all its possible 
states and to take a memory dump at the precise moment. 

We tried to analyze at least the in our view most rele- 
vant states concerning swapping and suspend. Both are 
of special interest for TRESOR as they swap CPU reg- 
isters into RAM or even further onto disk. We induced 
swapping by creating large data structures in RAM. Once 
Linux began to swap data onto disk we took a mem- 
ory dump and a disk dump and analyzed both with 
the methods mentioned before. Parts of the secret key 
could neither be traced on disk nor in RAM. (Also for 
generic AES, we never found sensitive information on 
disk because kernel space memory is not swappable un- 
der Linux.) 

To examine TRESOR’s behavior for ACPI S3 (sus- 
pend to RAM) we performed tests in Qemu and addi- 
tionally on real hardware because Qemu fails to wake up 
after S3. ACPI S4 (suspend to disk) on the other hand 
works just fine under Qemu. Our findings indicate that 
knowledge about the secret key is lost during both sus- 
pend modes, because again, neither in RAM nor on disk 
we could trace the key. As the CPU is switched off dur- 
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Active AES | Generic TRESOR None 
Kernel state | normal | normal | swapping | suspend | normal 
Key recovery (AESKeyFind) yes no no no no 
Dummy key matches - yes yes* yes - 
Real key matches yes no no no no 
Longest match of real key (bytes) 32 3 3 3 3 
*) found in RAM, not on disk 








Table 4: AES-256 key tracing, an overview 


ing suspension, the key is irretrievably lost. We also ver- 
ified this by looking into the CPU registers before, dur- 
ing, and after suspension. After suspension, the CPU 
context is restored completely except for the debug reg- 
isters. (Therefore, TRESOR prompts the user to re-enter 
the password upon wakeup). 

Table 4 summarizes our findings of tracing an 
AES-256 key in RAM and disk storage. Only using 
Linux’ generic implementation of AES, the secret key 
can be recovered by AESKeyFind and Interrogate. Us- 
ing TRESOR, or no disk encryption system at all, no 
key can be recovered. Indeed, we can trace the dummy 
key of TRESOR as it is stored in RAM by the Crypto- 
API, but the dummy key is of absolutely no importance. 
AESKeyFind cannot even recover the dummy key be- 
cause no key schedule of it is ever computed or stored 
in RAM. The full 256-bit pattern of the real key can 
only be traced running generic AES. Under TRESOR, 
the longest sequence matching the real key has no more 
than three bytes in all kernel states we tested. 

Concluding, we searched for the secret AES key with 
different methods in different situations and neither the 
entire key, nor any parts of it, could ever be traced in 
RAM. While this proves that we are successfully keep- 
ing the key away from RAM in general, we have no per- 
suasive argument that the key never enters RAM. Admit- 
tedly, it is unlikely that a piece of code other than context 
switching swaps debug registers into RAM, but it can- 
not be ruled out. Anyhow, even for the hypothetical case 
where such a piece of code exists, we are quite confident 
that we can patch it. Hence, the feasibility of a system 
like TRESOR and its fundamental idea to store secret 
keys in debug registers is not at risk. 


5.2 Processor attacks 


Now that the key is stored inside the CPU and never en- 
ters RAM, attackers may target processor registers rather 
than RAM. Basically there are two ways to attack pro- 
cessor registers: on software and on hardware layer. 

On software layer we distinguish attackers who could 
gain root access and attackers with an unprivileged ac- 
cess. Naturally, for attackers with standard user privi- 
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leges there should be no way to read out the key. As de- 
bug registers are only accessible from kernel space and 
as the only way for standard users to execute kernel code 
are system calls, unprivileged attackers are successfully 
defeated by the pt race patch. We verified this with a 
user space utility making pt race calls to read out de- 
bug registers; as was expected, only 0 is returned to user 
space. Overwriting the key via pt race is not possible 
either; here -EBUSY (dr0 to dr3) and -EPERM (dr6 and 
dr7) are returned. 

















For root the situation is different, because for root 
there are more ways to execute kernel code: via mod- 
ules (LKMs) and via /dev/kmem. If Linux is compiled 
with LKM or KMEM support, root can insert arbitrary 
code into a running kernel and execute it with ring 0 priv- 
ileges. To demonstrate this for LKMs, we have created a 
small malicious module reading out the debug registers 
and writing them into the kernel log file. A similar attack 
is possible by writing to /dev/kmem. Thus, if com- 
piled with LKM or KMEM support, root can gain full 
access to the TRESOR key. On the other hand, if com- 
piled without LKM and KMEM support, even root has 
no ability to access the secret key — an advantage over 
conventional disk encryption systems where root can al- 
ways read and write the secret key from RAM. Running 
TRESOR without LKM and KMEM support, the key can 
be set once upon boot but never be retrieved or manipu- 
lated while the system is running. 

Besides the software layer, the hardware layer is crit- 
ical. With physical access to the machine, new possibil- 
ities open up for the attacker. First of all, for advanced 
electrical engineers it may be possible to read out regis- 
ters of a running CPU with an oscilloscope, by measur- 
ing the electromagnetic field around the CPU or what- 
ever else. But we are not aware of any successful attacks 
of this type. 

Instead, we focus on a simpler scenario: it may be 
possible to reboot the machine with a malicious boot de- 
vice reading out what is left in CPU registers (similar 
to cold boot attacks [14]). Performing such an attack, 
the interesting question is whether CPU registers are re- 
set to zero upon reboot or keep their contents until they 
are used otherwise. Besides the BIOS version and CPU 
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reinitialization code, the answer may depend on whether 
the machine was rebooted by a software interrupt (e.g., 
by pressing CTRL-ALT-DEL) or by pressing a hardware 
reset button. While the former method keeps the CPU on 
power, the latter switches it off briefly. 

To investigate the practical impact of such an at- 
tack, we developed a malicious boot device called Cobra 
(Cold Boot Register Attack). First tested on virtual ma- 
chines, Cobra revealed that debug registers are reset on 
hardware reboots but not on software reboots. On soft- 
ware reboots, Cobra was able to restore debug regis- 
ters; all tested virtual machines (Qemu, Bochs, VMware 
and VirtualBox) showed this behavior. If real hardware 
showed this behavior as well, the consequences would 
be fatal. It would be an ease to read out the secret key 
and hence, TRESOR would be practically useless. For- 
tunately, it turned out that all VMs have a little imple- 
mentation flaw regarding this attack. On real hardware, 
debug registers are always reset to zero — also upon soft- 
ware reboots. We verified this by testing different ma- 
chines with different processors and BIOS versions. Ta- 
ble 5 gives an overview of our findings. 























BIOS Soft Reboot | Hard Reset 
Athlon 64. | AMI - - 
Pentium 4 | Phoenix - - 
Pentium M | First - - 
Celeron M | Phoenix - - 
Core2 Duo | First - - 
Core 15 Phoenix - - 
Core 17 Lenovo - - 
Qemu Bochs x - 
Bochs Bochs x - 
VMware Phoenix x - 
VirtualBox | N/A x - 

[x] =vulnerable — [-] = not vulnerable 














Table 5: Cobra (Cold Boot Register Attack) 


Overall, we argue that TRESOR is secure against 
local, unprivileged attacks in any case. Beyond that, 
TRESOR is even secure against attackers who could gain 
root access, if the kernel is compiled without LKM and 
KMEM support. On hardware level, TRESOR with- 
stands cold boot attacks against both main memory and 
CPU registers. This does only hold for real hardware, 
but running TRESOR inside a virtual machine is insecure 
anyhow as register contents of the guest are simulated in 
the host’s main memory. 


5.3 Side channel attacks 


Last, we want to mention briefly that TRESOR is resis- 
tant to timing attacks [26]. This is not the achievement of 
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ourselves but that of Intel, or, to be more precise, that of 
AES-NI. Intel states: “Beyond improving performance, 
the AES instructions provide important security bene- 
fits. By running in data-independent time and not us- 
ing tables, they help in eliminating the major timing and 
cache-based attacks that threaten table-based software 
implementations of AES.” [31] Based on this statement 
and the fact that there are no input dependent branches in 
the control flow of our code, we argue that TRESOR is 
resistant to side channel attacks, too. 


6 Conclusions and Future Work 


In the face of known attacks against main memory (above 
all, DMA and cold boot attacks) we consider RAM as too 
insecure to guarantee the confidentiality of secret disk 
encryption keys today. Thus we presented TRESOR, an 
approach to prevent main memory attacks against AES 
by implementing the encryption algorithm and its key 
management entirely on the microprocessor, solely using 
processor registers. We first explained important design 
choices of TRESOR and the key aspects of its imple- 
mentation. We then discussed how we integrated it into 
the Linux kernel. Eventually we showed that it performs 
well in comparison to the generic version of AES and, 
most importantly, that it satisfies our security policy. 


6.1 Conclusions 


Our primary security goal was to prevent tracing of the 
secret key in volatile memory, effectively making attacks 
on main memory pointless. Despite considerable effort, 
we were not able to retrieve the key in RAM. Therefore, 
we are confident that TRESOR is a substantial improve- 
ment compared to conventional disk encryption systems. 
As we took perfectly intact memory images of a run- 
ning TRESOR VM and knew the key beforehand, we 
had an advantage over real attackers trying to retrieve an 
unknown key. This strengthens our test results, because 
if we cannot retrieve the known key in an unscathed im- 
age, it is even more unlikely that an attacker can retrieve 
an unknown key in a partially damaged image. 

Another security goal was, of course, not to intro- 
duce flaws which are not present in ordinary encryp- 
tion systems. Therefore, we showed that TRESOR is 
safe against local attacks on the software layer as well 
as on the hardware layer. Interestingly, if the kernel is 
compiled without LKM and KMEM support, there is no 
(known) way to retrieve the secret key even though privi- 
leged root access is given — again, a substantial improve- 
ment compared to conventional disk encryption systems. 

Besides evaluating security aspects, we collected 
performance benchmarks, revealing that TRESOR is 
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slightly faster than common versions of AES. Further- 
more, we showed that the reactivity of Linux is not af- 
fected by the atomicity of encryption and decryption. 

Summarizing, TRESOR runs encryption securely out- 
side RAM and thereby it achieves a higher security than 
any disk encryption system we know — without losing 
performance or compatibility with existing applications. 
To conclude, it is possible to treat RAM as untrusted and 
to store secret keys in a safe place of today’s x86 standard 
architecture. 


6.2. Future Work 


Currently, TRESOR allows only to store a single, static 
key, because the debug registers cannot hold a second 
one. Future versions of TRESOR may keep multiple disk 
encryption keys securely inside RAM by srambling them 
with a master key, like in Loop-Amnesia [27]. 

This idea may be extended to an even broader use case 
in the future: Further AES keys to be used in conjunction 
with IPSec or SSL, i.e., to be used in conjunction with the 
userland, could be encrypted with the TRESOR master 
key and stored securely inside RAM. Session keys could 
be set and removed dynamically in any quantity. Using 
such a session key to encrypt an input block, the user 
space application would have to make a special system 
call that: 1) invokes TRESOR to read and decrypt the de- 
sired key and 2) lets TRESOR use the recently decrypted 
key to encrypt the input block. Between these steps, the 
session key may not leave the processor, meaning both 
steps need to happen inside the same atomic section. As 
a downside, such a system would require user space sup- 
port and would induce a performance penalty. 

Another future task is to move the secret key into reg- 
isters which are even less frequently used than the debug 
registers, e.g., into machine specific registers (MSRs). 
As a benefit, by using MSRs as cryptographic key stor- 
age, debuggers would be able to use hardware break- 
points and watchpoints again. However, the best way 
to get round this problem would be the introduction of 
a special key register into future versions of AES-NI by 
Intel or AMD. 

Last, we want to investigate the possibility of imple- 
menting a TRESOR like system as third party applica- 
tion for Windows. 
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Availability 


TRESOR is free software published under the 
GNU GPL v2 [32]. Its source is available at 
wwwl.informatik.uni-erlangen.de/tresor. 
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A Appendix 


A.1l AES-128 Source Code 


-set rstate, 
.set rhelp, 
-set rkO, 
-set rkl, 
.set rk2, 
-set rk3, 
.set rk4, 
-set rk5, 
-set rk6, 
-set rk7, 
.set rk8, 
.-set rk9, 
.set rki10, 





Sxmm0O 
Sxmm1 
Sxmm2 
$xmm3 
Sxmm4 
Sxmm5 
Sxmm6 
$xmm7 
Sxmm8 
Sxmm9 
$xmm10 
Sxmm11 
Sxmm12 


// 
// 
// 
// 
// 
// 
// 
// 
pf 
// 
// 
// 
// 


AES state 
helping reg 





round 
round 
round 
round 
round 
round 
round 
round 
round 
round 
round 


key 
key 
key 
key 
key 
key 
key 
key 
key 
key 
key 


Fo mMWATA BP WNEF OC 


-macro key_schedule rO rl recon 


pxor rhelp, rhelp 
movdqu \rc0,\r1 
shufps SOx1lf,\rl,rhelp 
pxor rhelp, \rl 
shufps $0x8c,\rl,rhelp 
pxor rhelp, \rl 
aeskeygenassist $\rcon,\r0,rhelp 
shufps SOxff, rhelp, rhelp 
pxor rhelp, \rl 

-endm 

movq Sdb0, srax 

movq Srax,\r0 

movq Sdbl, srax 

movq srax, rhelp 

shufps $0x44,rhelp, \r0 

pxor rk0O,rstate 


key_schedule 
key_schedule 
key_schedule 
key_schedule 
key_schedule 
key_schedule 
key_schedule 
key_schedule 
key_schedule 
key_schedule 


rkO 
rkl 
rk2 
rk3 
rk4 
rk5 
rk6 
rk7 
rk8 
rk9 


rkl Oxl 
rk2 Ox2 
rk3 Ox4 
rk4 0x8 
rk5 0x10 
rk6 0x20 
rk7 0x40 
rk8 0x80 
rk9 Oxlb 
rk10 0x36 
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aesenc 
aesenc 
aesenc 
aesenc 
aesenc 
aesenc 
aesenc 
aesenc 
aesenc 


rkl,rstate 
rk2,rstate 
rk3,rstate 
rk4,rstate 
rk5,rstate 
rk6,rstate 
rk7,rstate 
rk8,rstate 
rk9,rstate 


aesen 


clast 


rk10,rstate 


A.2 dd Benchmarks for AES-192 

















--- Plain 

410 MB copied, 6.30053 
410 MB copied, 6.93762 
410 MB copied, 10.0737 
410 MB copied, 9.66396 
410 MB copied, 8.20149 
410 MB copied, 7.42723 
410 MB copied, 7.16408 
410 MB copied, 8.54818 
410 MB copied, 9.91214 
410 MB copied, 6.91875 
410 MB copied, 10.3003 
410 MB copied, 8.63959 
410 MB copied, 10.3342 
410 MB copied, 8.75659 
410 MB copied, 8.12789 
410 MB copied, 8.96658 
410 MB copied, 7.90555 
410 MB copied, 11.7209 
410 MB copied, 8.31128 
410 MB copied, 11.8716 
410 MB copied, 9.90721 
410 MB copied, 8.57025 
410 MB copied, 9.34468 
410 MB copied, 9.14162 
—--— TRESOR 

410 MB copied, 23.9045 
410 MB copied, 24.1203 
410 MB copied, 26.3410 
410 MB copied, 22.1279 
410 MB copied, 24.9356 
410 MB copied, 25.0071 
410 MB copied, 23.5777 
410 MB copied, 27.8006 
410 MB copied, 24.8987 
410 MB copied, 25.8959 
410 MB copied, 25.7694 
410 MB copied, 26.5178 
410 MB copied, 25.3663 
410 MB copied, 25.0566 
410 MB copied, 25.4963 
410 MB copied, 24.3083 
410 MB copied, 23.9965 
410 MB copied, 25.2287 
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Ss, 
Ss, 
Ss, 
Ss, 
Ss, 
Ss, 
Ss, 
Ss, 
Ss, 
Ss, 
Ss, 
Ss, 
Ss, 
Ss, 
Ss, 
Ss, 
Ss, 
Ss, 
Ss, 
Ss, 
Ss, 
Ss, 
Sy 
Ss, 


Ss, 
Ss, 
Sy 
Ss, 
Ss, 
Ss, 
Sy 
Sy 
Sy 
Ss, 
Ss, 
Ss, 
Ss, 
Ss, 
Ss, 
Ss, 
Ss, 
Ss, 
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Generic AES 
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A.3. Interbench for AES-256 








--- Plain 

Load Latency +/- SD (ms) Max Latency % Desired CPU) % Deadlines Met 
None 0.143 +/- 1.51 26.2 100 99 )03. 

X 0.399 +/- 2.34 23.8 100 98.6 

Burn 0.23 +/- 1.93 39 99.9 99.2 

Write 0.16 +/- 0.852 20.7 100 99:.°9 

Read 0.118 +/- 0.772 20.2 100 991.9 
Compile 0.738 +/- 3.47 32.4 100 97 
Memload 0.009 +/- 0.027 0.498 100 100 

--- TRESOR 

Load Latency +/- SD (ms) Max Latency % Desired CPU) % Deadlines Met 
None 0.926 +/- 3.99 3123 9:9: 29 94.9 

X 6.65 +/- 11.3 44.3 99.6 64.8 

Burn 28 +/- 31 66.6 48.6 2.45 

Write 1.9: +/-— 5.87 33.7 99.8 89.9 

Read 2.41 +/- 6.32 27.8 100 86.2 
Compile 26.4 +/- 30.3 79.2 50.1 4.73 
Memload 9.24 +/- 13.2 48.7 98.3 48.7 

--- Generic AES 

Load Latency +/- SD (ms) Max Latency % Desired CPU) % Deadlines Met 
None 1.4 +/- 4.96 34.2 99:59 92.3 

X 6213 #/= T1.3 43 99.2 63.5 

Burn 29.2 +/- 32.4 66.5 45.8 Bie AE 

Write 0.657. #/= 233.39 36.5 99.8 96.9 

Read 3.05 +/- 7.18 3.6.39 99.9 82.5 
Compile 26.8 +/- 30.3 64.7 48.9 4.09 
Memload 9.34 +/- 13.4 45.3 97.7 48.5 
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Bubble Trouble: Off-Line De-Anonymization of Bubble Forms 


Joseph A. Calandrino, William Clarkson and Edward W. Felten 
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Abstract 


Fill-in-the-bubble forms are widely used for surveys, 
election ballots, and standardized tests. In these and 
other scenarios, use of the forms comes with an implicit 
assumption that individuals’ bubble markings them- 
selves are not identifying. This work challenges this 
assumption, demonstrating that fill-in-the-bubble forms 
could convey a respondent’s identity even in the absence 
of explicit identifying information. We develop methods 
to capture the unique features of a marked bubble and 
use machine learning to isolate characteristics indicative 
of its creator. Using surveys from more than ninety indi- 
viduals, we apply these techniques and successfully re- 
identify individuals from markings alone with over 50% 
accuracy. This bubble-based analysis can have either 
positive or negative implications depending on the ap- 
plication. Potential applications range from detection of 
cheating on standardized tests to attacks on the secrecy 
of election ballots. To protect against negative conse- 
quences, we discuss mitigation techniques to remove a 
bubble’s identifying characteristics. We suggest addi- 
tional tests using longitudinal data and larger datasets 
to further explore the potential of our approach in real- 
world applications. 


1 Introduction 


Scantron-style fill-in-the-bubble forms are a popular 
means of obtaining human responses to multiple-choice 
questions. Whether conducting surveys, academic tests, 
or elections, these forms allow straightforward user com- 
pletion and fast, accurate machine input. Although not 
every use of bubble forms demands anonymity, common 
perception suggests that bubble completion does not re- 
sult in distinctive marks. We demonstrate that this as- 
sumption is false under certain scenarios, enabling use 
of these markings as a biometric. The ability to uncover 
identifying bubble marking patterns has far-reaching po- 
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tential implications, from detecting cheating on standard- 
ized tests to threatening the anonymity of election bal- 
lots. 

Bubble forms are widely used in scenarios where con- 
firming or protecting the identity of respondents is crit- 
ical. Over 137 million registered voters in the United 
States reside in precincts with optical scan voting ma- 
chines [27], which traditionally use fill-in-the-bubble pa- 
per ballots. Voter privacy (and certain forms of fraud) 
relies on an inability to connect voters with these bal- 
lots. Surveys for research and other purposes use bub- 
ble forms to automate data collection. The anonymity 
of survey subjects not only affects subject honesty but 
also impacts requirements governing human subjects re- 
search [26]. Over 1.6 million members of the high school 
class of 2010 completed the SAT [8], one of many large- 
scale standardized tests using bubble sheets. Educators, 
testing services, and other stakeholders have incentives 
to detect cheating on these tests. The implications of 
our findings extend to any use of bubble forms for which 
the ability to “fingerprint” respondents may have conse- 
quences, positive or negative. 


Our contributions. We develop techniques to extract 
distinctive patterns from markings on completed bubble 
forms. These patterns serve as a biometric for the form 
respondent. To account for the limited characteristics 
available from markings, we apply a novel combination 
of image processing and machine learning techniques 
to extract features and determine which are distinctive 
(see Section 2). These features can enable discovery of 
respondents’ identities or of connections between com- 
pleted bubbles. 

To evaluate our results on real-world data, we use a 
corpus of over ninety answer sheets from an unrelated 
survey of high school students (see Section 3). We train 
on a subset of completed bubbles from each form, ef- 
fectively extracting a biometric for the corresponding re- 
spondent. After training, we obtain a test set of addi- 
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(a) Person 1 (b) Person 2 


(c) Person 3 





(d) Person 4 


(e) Person 4 - Gray 


Figure 1: Example marked bubbles. The background color is white in all examples except Figure 1(e), which is gray. 


tional bubbles from each form and classify each test set. 
For certain parameters, our algorithms’ top match is cor- 
rect over 50% percent of the time, and the correct value 
falls in the top 3 matches 75% percent of the time. In ad- 
dition, we test our ability to detect when someone other 
than the expected respondent completes a form, simulta- 
neously achieving false positive and false negative rates 
below 10%. We conduct limited additional tests to con- 
firm our results and explore details available from bubble 
markings. 

Depending on the application, these techniques can 
have positive or negative repercussions (see Section 4). 
Analysis of answer sheets for standardized tests could 
provide evidence of cheating by test-takers, proctors, or 
other parties. Similarly, scrutiny of optical-scan bal- 
lots could uncover evidence of ballot-box stuffing and 
other forms of election fraud. With further improvements 
in accuracy, the methods developed could even enable 
new forms of authentication. Unfortunately, the tech- 
niques could also undermine the secret ballot and anony- 
mous surveys. For example, some jurisdictions publish 
scanned images of ballots following elections, and em- 
ployers could match these ballots against bubble-form 
employment applications. Bubble markings serve as a 
biometric even on forms and surveys otherwise contain- 
ing no identifying information. We discuss methods for 
minimizing the negative impact of this work while ex- 
ploiting its positive uses (see Section 5). 

Because our test data is somewhat limited, we dis- 
cuss the value of future additional tests (see Section 7). 
For example, longitudinal data would allow us to better 
understand the stability of an individual’s distinguishing 
features over time, and stability is critical for most uses 
discussed in the previous paragraph. 


2 Learning Distinctive Features 
Filling in a bubble is a narrow, straightforward task. 


Consequently, the space for inadvertent variation is rela- 
tively constrained. The major characteristics of a filled- 
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in bubble are consistent across the image population— 
most are relatively circular and dark in similar locations 
with slight imperfections—resulting in a largely homo- 
geneous set. See Figure 1. This creates a challenge in 
capturing the unique qualities of each bubble and extrap- 
olating a respondent’s identity from them. 

We assume that all respondents start from the same 
original state—an empty bubble with a number inscribed 
corresponding to the answer choice (e.g., choices 1-5 in 
Figure 1). When respondents fill in a bubble, opportuni- 
ties for variation include the pressure applied to the draw- 
ing instrument, the drawing motions employed, and the 
care demonstrated in uniformly darkening the entire bub- 
ble. In this work, we consider applications for which it 
would be infeasible to monitor the exact position, pres- 
sure, and velocity of pencil motions throughout the col- 
oring process.! In other contexts, such as signature ver- 
ification, these details can be useful. This information 
would only strengthen our results and would be helpful 
to consider if performing bubble-based authentication, as 
discussed in Section 4. 


2.1 Generating a Bubble Feature Vector 


Image recognition techniques often use feature vectors 
to concisely represent the important characteristics of an 
image. As applied to bubbles, a feature vector should 
capture the unique ways that a mark differs from a per- 
fectly completed bubble, focusing on characteristics that 
tend to distinguish respondents. Because completed bub- 
bles tend to be relatively homogeneous in shape, many 
common metrics do not work well here. To measure the 
unique qualities, we generate a feature vector that blends 
several approaches from the image recognition literature. 
Specifically, we use PCA, shape descriptors, and a cus- 
tom bubble color distribution to generate a feature vector 
for each image. 


1Clarkson et al. [7] use multiple scans to infer the 3D surface texture 
of paper, which may suggest details like pressure. We assume multiple 
scans to be infeasible for our applications. 
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Figure 2: An example bubble marking with an approx- 
imating circle. The circle minimizes the sum of the 
squared deviation from the radius. We calculate the 
circle’s center and mean radius, the marking’s variance 
from the radius, and the marking’s center of mass. 


Principal Component Analysis (PCA) is one common 
technique for generating a feature set to represent an im- 
age [16]. At a high level, PCA reduces the dimensional- 
ity of an image, generating a concise set of features that 
are Statistically independent from one another. PCA be- 
gins with a sample set of representative images to gener- 
ate a set of eigenvectors. In most of our experiments, the 
representative set was comprised of 368 images and con- 
tained at least one image for each (respondent, answer 
choice) pair. Each representative image is normalized 
and treated as a column in a matrix. PCA extracts a set 
of eigenvectors from this matrix, forming a basis. We re- 
tain the 100 eigenvectors with the highest weight. These 
eigenvectors account for approximately 90% of the in- 
formation contained in the representative images. 

To generate the PCA segment of our feature vector, 
a normalized input image (treated as a column vector) 
is projected onto the basis defined by the 100 strongest 
eigenvectors. The feature vector is the image’s coordi- 
nates in this vector space—i.e., the weights on the eigen- 
vectors. Because PCA is such a general technique, it may 
fail to capture certain context-specific geometric charac- 
teristics when working exclusively with marked bubbles. 


To compensate for the limitations of PCA, we capture 
shape details of each bubble using a set of geometric 
descriptors and capture color variations using a custom 
metric. Peura et al. [24] describe a diverse a set of ge- 
ometric descriptors that measure statistics about various 
shapes. This set includes a shape’s center of mass, the 
center and radius of a circle approximating its shape, and 
variance of the shape from the approximating circle’s ra- 
dius (see Figure 2). The approximating circle minimizes 
the sum of squared radius deviations. We apply the spec- 
ified descriptors to capture properties of a marked bub- 
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Sector 3 


Sector 2 





Figure 3: Each dot is split into twenty-four 15° slices. 
Adjacent slices are combined to form a sector, spanning 
30°. The first few sectors are depicted here. 


PCA {Sector Shape | Color Distribution 


100 Features 368 Features 336 Features 





804 Features 


Figure 4: Feature vector components and their contribu- 
tions to the final feature vector length. 


ble’s boundary. Instead of generating these descriptors 
for the full marked bubble alone, we also generate the 
center of mass, mean radius, and radial variance for “‘sec- 
tors” of the marked bubble. To form these sectors, we 
first evenly divide each dot into twenty-four 15° “slices.” 
Sectors are the 24 overlapping pairs of adjacent slices 
(see Figure 3). Together, these geometric descriptors add 
368 features. 


Finally, we developed and use a simple custom metric 
to represent color details. We divide a dot into sectors as 
in the previous paragraph. For each sector, we create a 
histogram of the grayscale values for the sector consist- 
ing of fifteen buckets. We throw away the darkest bucket, 
as these pixels often represent the black ink of the circle 
border and answer choice numbering. Color distribution 
therefore adds an additional 14 features for each sector, 
or a total of 336 additional features. 


The resulting feature vector consists of 804 features 
that describe shape and color details for a dot and each 
of its constituent sectors (see Figure 4). See Section 3.3, 
where we evaluate the benefits of this combination of 
features. Given feature vectors, we can apply machine 
learning techniques to infer distinguishing details and 
differentiate between individuals. 
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2.2 Identifying Distinguishing Features 


Once a set of feature vectors are generated for the rele- 
vant dots, we use machine learning to identify and utilize 
the important features. Our analysis tools make heavy 
use of Weka, a popular Java-based machine learning 
workbench that provides a variety of pre-implemented 
learning methods [12]. In all experiments, we used Weka 
version 3.6.3. 

We apply Weka’s implementation of the Sequential 
Minimal Optimization (SMO) supervised learning al- 
gorithm to infer distinctive features of respondents and 
classify images. SMO is an efficient method for train- 
ing support vector machines [25]. Weka can accept a 
training dataset as input, use the training set and learn- 
ing algorithm to create a model, and evaluate the model 
on a test set. In classifying individual data points, Weka 
internally generates a distribution over possible classes, 
choosing the class with the highest weight. For us, this 
distribution is useful in ranking the respondents believed 
to be responsible for a dot. We built glue code to collect 
and process both internal and exposed Weka data effi- 
ciently. 


3 Evaluation 


To evaluate our methods, we obtained a corpus of 154 
surveys distributed to high school students for research 
unrelated to our study. Although each survey is ten 
pages, the first page contained direct identifying infor- 
mation and was removed prior to our access. Each of the 
nine available pages contains approximately ten ques- 
tions, and each question has five possible answers, se- 
lected by completing round bubbles numbered 1-5 (as 
shown in Figure 1). 

From the corpus of surveys, we removed any com- 
pleted in pen to avoid training on writing utensil or pen 
color.2 Because answer choices are numbered, some risk 
exists of training on answer choice rather than marking 
patterns—e.g., respondent X tends to select bubbles with 
“4” in the background. For readability, survey questions 
alternate between a white background and a gray back- 
ground. To avoid training bias, we included only surveys 
containing at least five choices for each answer 1-4 ona 
white background (except where stated otherwise), leav- 
ing us with 92 surveys. 

For the 92 surveys meeting our criteria, we scanned 
the documents using an Epson v700 Scanner at 1200 
DPI. We developed tools to automatically identify, ex- 
tract, and label marked bubbles by question answered 


2We note that respondents failing to use pencil or to complete the 
survey anecdotally tended not to be cautious about filling in the bubbles 
completely. Therefore, these respondents may be more distinguishable 
than those whose surveys were included in our experiment. 
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and choice selected. After running these tools on the 
scanned images, we manually inspected the resulting im- 
ages to ensure accurate extraction and labeling. 

Due to the criteria that we imposed on the surveys, 
each survey considered has at least twenty marked bub- 
bles on a white background, with five bubbles for the “1” 
answer, five for the “2” answer, five for the “3” answer, 
and five for the “4” answer.? For each experiment, we 
selected our training and test sets randomly from this set 
of twenty bubbles, ensuring that sets have equal numbers 
of “1,” “2,” “3,” and “4” bubbles for each respondent and 
trying to balance the number of bubbles for each answer 
choice when possible. 

In all experiments, a random subset of the training set 
was selected and used to generate eigenvectors for PCA. 
We required that this subset contain at least one exam- 
ple from each respondent for each of the four relevant 
answer choices but placed no additional constraints on 
selection. For each dot in the training and test sets, we 
generated a feature vector using PCA, geometric descrip- 
tors, and color distribution, as described in Section 2.1. 

We conducted two primary experiments and a number 
of complementary experiments. The first major test ex- 
plores our ability to re-identify a respondent from a test 
set of eight marks given a training set of twelve marks per 
respondent. The second evaluates our ability to detect 
when someone other than the official respondent com- 
pletes a bubble form. To investigate the potential of 
bubble markings and confirm our results, we conducted 
seven additional experiments. We repeated each experi- 
ment ten times and report the average of these runs. 

Recall from Section 2.2 that we can rank the respon- 
dents based on how strongly we believe each one to be 
responsible for a dot. For example, the respondent that 
created a dot could be the first choice or fiftieth choice 
of our algorithms. A number of our graphs effectively 
plot a cumulative distribution showing the percent of test 
cases for which the true corresponding respondent falls 
at or above a certain rank—e.g., for 75% of respondents 
in the test set, the respondent’s true identity is in the top 
three guesses. 


3.1 Respondent Re-Identification 


This experiment measured the ability to re-identify in- 
dividuals from their bubble marking patterns. For this 
test, we trained our model using twelve sample bubbles 
per respondent, including three bubbles for each answer 
choice 1-4. Our test set for each respondent contained 
the remaining two bubbles for each answer choice, for a 
total of eight test bubbles. We applied the trained model 


3To keep a relatively large number of surveys, we did not consider 
the number of “5” answers and do not use these answers in our analysis. 
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Figure 5: Respondent re-identification with 12 training 
bubbles and 8 test bubbles per respondent. 


to each of the 92 respondents’ test sets and determined 
whether the predicted identity was correct. 

To use multiple marks per respondent in the test set, 
we classify the marks individually, yielding a distribu- 
tion over the respondents for each mark in the set. After 
obtaining the distribution for each test bubble in a group, 
we combine this data by averaging the values for each 
respondent. Our algorithms then order the respondents 
from highest to lowest average confidence, with highest 
confidence corresponding to the top choice. 

On average, our algorithm’s first guess identified the 
correct respondent with 51.1% accuracy. The correct re- 
spondent fell in the top three guesses 75.0% of the time 
and in the top ten guesses 92.4% of the time. See Fig- 
ure 5, which shows the percentage of test bubbles for 
which the correct respondent fell at or above each pos- 
sible rank. This initial result suggests that individuals 
complete bubbles in a highly distinguishing manner, al- 
lowing re-identification with surprisingly high accuracy. 


3.2 Detecting Unauthorized Respondents 


One possible application of this technique is to detect 
when someone other than the authorized respondent cre- 
ates a set of bubbles. For example, another person might 
take a test or survey in place of an authorized respondent. 
We examined our ability to detect these cases by mea- 
suring how often our algorithm would correctly detect a 
fraudulent respondent who has claimed to be another re- 
spondent. We trained our model using twelve training 
samples from each respondent and examined the output 
of our model when presented with eight test bubbles. The 
distribution of these sets is the same as in Section 3.1. 
For these tests, we set a threshold for the lowest rank 
accepted as the respondent. For example, suppose that 
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Figure 6: False positive and false negative rates when 
detecting unauthorized respondents. 


the threshold is 12. To determine whether a given set of 
test bubbles would be accepted for a given respondent, 
we apply our trained model to the test set. If the respon- 
dent’s identity appears in any of the top 12 (of 92) posi- 
tions in the ranked list of respondents, that test set would 
be accepted for the respondent. For each respondent, we 
apply the trained model both to the respondent’s own test 
bubbles and to the 91 other respondents’ test bubbles. 

We used two metrics to assess the performance of our 
algorithms in this scenario. The first, false positive rate, 
measures the probability that a given respondent would 
be rejected (labeled a cheater) for bubbles that the re- 
spondent actually completed. The second metric, false 
negative rate, measures the probability that bubbles com- 
pleted by any of the 91 other respondents would be ac- 
cepted as the true respondent’s. We varied the threshold 
from | to 92 for our tests. We expected the relationship 
between threshold and false negative rate to be roughly 
linear: increasing the threshold by 1 increases the proba- 
bility that a respondent randomly falls above the thresh- 
old for another respondent's test set by roughly 1/92.+ 

Our results are presented in Figure 6. As we increase 
the threshold, the false positive rate drops precipitously 
while the false negative rate increases roughly linearly. 
If we increase the threshold to 8, then a fraudulent re- 
spondent has a 7.8% chance of avoiding detection (by 
being classified as the true respondent), while the true 
respondent has a 9.9% chance of being mislabeled a 
cheater. These error rates intersect with a threshold ap- 
proximately equal to 9, where the false positive and false 
negative rates are 8.8%. 


“This is not exact because the order of these rankings is not entirely 
random. After all, we seek to rank a respondent as highly as possible 
for the respondent’s own test set. 
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Figure 7: Respondent re-identification accuracy using 


lower-resolution images. Note that the 1200, 600, 300, 
and 150 DPI lines almost entirely overlap. 


3.3. Additional Experiments 


To study the information conveyed by bubble markings 
and support our results, we performed seven comple- 
mentary experiments. In the first, we evaluate the ef- 
fect that scanner resolution has on re-identification ac- 
curacy. Next, we considered our ability to re-identify a 
respondent from a single test mark given a training set 
containing a single training mark from each respondent. 
Because bubble forms typically contain multiple mark- 
ings, this experiment is somewhat artificial, but it hints 
at the information available from a single dot. The third 
and fourth supplemental experiments explored the ben- 
efits of increasing the training and test set sizes respec- 
tively while holding the other set to a single bubble. In 
the fifth test, we examined the tradeoff between training 
and test set sizes. The final two experiments validated 
our results using additional gray bubbles from the sam- 
ple surveys and demonstrated the benefits of our feature 
set over PCA alone. As with the primary experiments, 
we repeated each experiment ten times. 


Effect of resolution on accuracy. In practice, high- 
resolution scans of bubble forms may not be available, 
but access to lower resolution scans may be feasible. To 
determine the impact of resolution on re-identification 
accuracy, we down-sampled each ballot from the orig- 
inal 1200 DPI to 600, 300, 150, and 48 DPI. We then 
repeated the re-identification experiment of Section 3.1 
on bubbles at each resolution. 

Figure 7 shows that decreasing the image resolution 
has little impact on performance for resolutions above 
150 DPI. At 150 DPI, the accuracy of our algorithm’s 
first guess decreases to 45.1% from the 51.1% accu- 
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Figure 8: One marked bubble per respondent in each of 
the training and test sets. The expected value from ran- 
dom guessing is provided as reference. 


racy observed at 1200 DPI. Accuracy remains relatively 
strong even at 48 DPI, with the first guess correct 36.4% 
of the time and the correct respondent falling in the top 
ten guesses 86.8% of the time. While down-sampling 
may not perfectly replicate scanning at a lower resolu- 
tion, these results suggest that strong accuracy remains 
feasible even at resolutions for which printed text is dif- 
ficult to read. 


Single bubble re-identification. This experiment 
measured the ability to re-identify an individual using a 
single marked bubble in the test set and a single example 
per respondent in the training set. This is a worst-case 
scenario, as bubble forms typically contain multiple 
markings. We extracted two bubbles from each survey 
and trained a model using the first bubble. We then 
applied the trained model to each of the 92 second 
bubbles and determined whether the predicted identity 
was correct. Under these constrained circumstances, an 
accuracy rate above that of random guessing (approxi- 
mately 1%) would suggest that marked bubbles embed 
distinguishing features. 

On average, our algorithm’s first guess identified the 
correct respondent with 5.3% accuracy, five times better 
than the expected value for random guessing. See Fig- 
ure 8, which shows the percentage of test bubbles for 
which the correct respondent fell at or above each pos- 
sible rank. The correct respondent was in the top ten 
guesses 31.4% of the time. This result suggests that indi- 
viduals can inadvertently convey information about their 


5Note: In this experiment, we removed the restriction that the set 
of images used to generate eigenvectors for PCA contains an example 
from each column. 
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Figure 9: Increasing the training set size from 1 to 19 
dots per respondent. 


identities from even a single completed bubble. 


Increasing training set size. In practice, respondents 
rarely fill out a single bubble on a form, and no two 
marked bubbles will be exactly the same. By training 
on multiple bubbles, we can isolate patterns that are con- 
sistent and distinguishing for a respondent from ones that 
are largely random. This experiment sought to verify this 
intuition by confirming that an increase in the number of 
training samples per respondent increases accuracy. We 
held our test set at a single bubble for each respondent 
and varied the training set size from | to 19 bubbles per 
respondent (recall that we have twenty total bubbles per 
respondent). 

Figure 9 shows the impact various training set sizes 
had on whether the correct respondent was the top guess 
or fell in the top 3, 5, or 10 guesses. Given nineteen train- 
ing dots and a single test dot, our first guess was correct 
21.8% of the time. The graph demonstrates that a greater 
number of training examples tends to result in more ac- 
curate predictions, even with a single-dot test set. For the 
nineteen training dots case, the correct respondent was 
in the top 3 guesses 40.8% of the time and the top 10 
guesses 64.5% of the time. 


Increasing test set size. This experiment is similar to 
the previous experiment, but we instead held the training 
set at a single bubble per respondent and varied the test 
set size from | to 19 bubbles per respondent. Intuitively, 
increasing the number of examples per respondent in the 
test set helps ensure that our algorithms guess based on 
consistent features—even if the training set is a single 
noisy bubble. 

Figure 10 shows the impact of various test set sizes 
on whether the correct respondent was the top guess or 
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Figure 10: Increasing the test set size from | to 19 dots 
per respondent. 


fell in the top 3, 5, or 10 guesses. We see more grad- 
ual improvements when increasing the test set size than 
observed when increasing training set size in the previ- 
ous test. From one to nineteen test bubbles per respon- 
dent, the accuracy of our top 3 and 5 guesses increases 
relatively linearly with test set size, yielding maximum 
improvements of 4.3% and 7.6% respectively. For the 
top-guess case, accuracy increases with test set size from 
5.3% at one bubble per respondent to 8.1% at eight bub- 
bles then roughly plateaus. Similarly, the top 10 guesses 
case plateaus near ten bubbles and has a maximum im- 
provement of 8.0%. Starting from equivalent sizes, the 
marginal returns from increasing the training set size 
generally exceed those seen as test set size increases. 
Next, we explore the tradeoff between both set sizes 
given a fixed total of twenty bubbles per respondent. 


Training-test set size tradeoff. Because we have a 
constraint of twenty bubbles per sample respondent, the 
combined total size of our training and test sets per re- 
spondent is limited to twenty. This experiment examined 
the tradeoff between the sizes of these sets. For each 
value of x from | to 19, we set the size of the training 
set per respondent to x and the test set size to 20 — x. In 
some scenarios, a person analyzing bubbles would have 
far larger training and test sets than in this experiment. 
Fortunately, having more bubbles would not harm per- 
formance: an analyst could always choose a subsample 
of the bubbles if it did. Therefore, our results provide a 
lower bound for these scenarios. 

Figure 11 shows how varying training/test set sizes af- 
fected whether the correct respondent was the top guess 
or fell in the top 3, 5, or 10 guesses. As the graph demon- 
strates, the optimal tradeoff was achieved with roughly 
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Figure 11: Trade-off between training and test set sizes. 





Figure 12: This respondent tends to have a circular pat- 
tern with a flourish stroke at the end. The gray back- 
ground makes the flourish stroke harder to detect. 


twelve bubbles per respondent in the training set and 
eight bubbles per respondent in the test set. 


Validation with gray bubbles. To further validate our 
methods, we tested the accuracy of our algorithms with a 
set of bubbles that we previously excluded: bubbles with 
gray backgrounds. These bubbles pose a significant chal- 
lenge as the paper has both a grayish hue and a regular 
pattern of darker spots. This not only makes it harder to 
distinguish between gray pencil lead and the paper back- 
ground but also limits differences in color distribution 
between users. See Figure 12. 

As before, we selected surveys by locating ones with 
five completed (gray) bubbles for each answer choice, 
1-4, yielding 97 surveys. We use twelve bubbles per re- 
spondent in the training set and eight bubbles in the test 
set, and we apply the same algorithms and parameters for 
this test as the test in Section 3.1 on a white background. 

Figure 13 shows the percentage of test cases for which 
the correct respondent fell at or above each possible 
rank. Our first guess is correct 42.3% of the time, with 
the correct respondent falling in the top 3, 5, and 10 
guesses 62.1%, 75.8%, and 90.0% of the time respec- 
tively. While slightly weaker than the results on a white 
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Figure 13: Using the unmodified algorithm with the 
same configuration as in Figure 5 on dots with gray back- 
grounds, we see only a mild decrease in accuracy. 
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Figure 14: Performance with various combinations of 
features. 


background for reasons specified above, this experiment 
suggests that our strong results are not simply a byprod- 
uct of our initial dataset. 


Feature vector options. As discussed in Section 2.1, 
our feature vectors combine PCA data, shape descrip- 
tors, and a custom color distribution to compensate for 
the limited data available from bubble markings. We 
tested the performance of our algorithms for equivalent 
parameters with PCA alone and with all three features 
combined. This test ran under the same setup as Figure 5 
in Section 3.1. 

For both PCA and the full feature set, Figure 14 shows 
the percentage of test cases for which the correct respon- 
dent fell at or above each possible rank. The additional 
features improve the accuracy of our algorithm’s first 
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(a) Person A 


(b) Person B 


Figure 15: Bubbles from respondents often mistaken for 
each other. Both respondents use superficially similar 
techniques, leaving unmarked space in similar locations. 


guess from 39.0% to 51.1% and the accuracy of the top 
ten guesses from 87.2% to 92.4%. 


3.4 Discussion 


Although our accuracy exceeds 50% for respondent re- 
identification, the restrictive nature of marking a bubble 
limits the distinguishability between users. We briefly 
consider a challenging case here. 

Figure 15 shows marked bubbles from two respon- 
dents that our algorithm often mistakes for one another. 
Both individuals appear to use similar techniques to com- 
plete a bubble: a circular motion that seldom deviates 
from the circle boundary, leaving white-space both in the 
center and at similar locations near the border. Unless the 
minor differences between these bubbles are consistently 
demonstrated by the corresponding respondents, differ- 
entiating between these cases could prove quite difficult. 
The task of completing a bubble is constrained enough 
that close cases are nearly inevitable. In spite of these 
challenges, however, re-identification and detection of 
unauthorized respondents are feasible in practice. 


4 Impact 


This work has both positive and negative implications de- 
pending on the context and application. While we limit 
our discussion to standardized tests, elections, surveys, 
and authentication, the ability to derive distinctive bub- 
ble completion patterns for individuals may have conse- 
quences beyond those examined here. In Section 7, we 
discuss additional tests that would allow us to better as- 
sess the impact in several of these scenarios. In particu- 
lar, most of these cases assume that an individual’s dis- 
tinguishing features remain relatively stable over time, 
and tests on longitudinal data are necessary to evaluate 
this assumption. 
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4.1 Standardized Tests 


Scores on standardized tests may affect academic 
progress, job prospects, educator advancement, and 
school funding, among other possibilities. These high 
stakes provide an incentive for numerous parties to cheat 
and for numerous other parties to ensure the validity of 
the results. In certain cheating scenarios, another party 
answers questions on behalf of an official test-taker. For 
example, a surrogate could perform the entire test, or a 
proctor could change answer sheets after the test [11, 17]. 
The ability to detect when someone other than the autho- 
rized test-taker completes some or all of the answers on 
a bubble form could help deter this form of cheating. 

Depending on the specific threat and available data, 
several uses of our techniques exist. Given past answer 
sheets, test registration forms, or other bubbles ostensi- 
bly from the same test-takers, we could train a model as 
in Section 3.2 and use it to infer whether a surrogate com- 
pleted some or all of a test.© Although the surrogate may 
not be in the training set, we may rely on the fact that the 
surrogate is less likely to have bubble patterns similar to 
the authorized test-taker than to another set of test-takers. 
Because our techniques are automated, they could flag 
the most anomalous cases—i.e., the cases that would be 
rejected even under the least restrictive thresholds—in 
large-scale datasets for manual review. 

If concern exists that someone changed certain an- 
swers after the test (for example, a proctor corrected 
the first five answers for all tests), we could search for 
questions that are correctly answered at an usually high 
rate. Given this information, two possible analysis tech- 
niques exists. First, we could train on the less suspicious 
questions and use the techniques of Section 3.2 to deter- 
mine whether the suspicious ones on a form are from the 
same test-taker. Alternatively, we could train on the non- 
suspicious answer choices from each form and the sus- 
picious answer choices from all forms other than a form 
of interest. Given this model, we could apply the tech- 
niques of Section 3.1 to see whether suspicious bubbles 
on that form more closely match less-suspicious bubbles 
on the same form or suspicious bubbles on other forms. 


4.2 Elections 


Our techniques provide a powerful tool for detecting cer- 
tain forms of election fraud but also pose a threat to voter 
privacy. 

Suppose that concerns exist that certain paper ballots 
were fraudulently submitted by someone other than a 
valid voter. Although the identity of voters might not 
be known for direct comparison to past ballots or other 


Note our assumption that the same unauthorized individual has not 
completed both the training bubbles and the current answer sheet. 
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forms, we can compare batches of ballots in one election 
to batches from past elections. Accounting for demo- 
graphic changes and the fact that voters need not vote 
in all elections, the ballots should be somewhat similar 
across elections. For example, if 85% of the voters on 
the previous election’s sign-in list also voted during this 
election, we would expect similarities between 85% of 
the old ballots and the new ballots. 

To test this, we may train on ballots from the previ- 
ous election cycle and attempt to re-identify new ballots 
against the old set. We would not expect ballots to per- 
fectly match. Nevertheless, if less than approximately 
85% of the old “identities” are covered by the new ballots 
or many of the new ballots cluster around a small num- 
ber of identities, this would raise suspicion that someone 
else completed these forms, particularly if the forms are 
unusually biased towards certain candidates or issues. 

Similarly, analysis could also help uncover fraudulent 
absentee ballots. Because absentee ballots do not require 
a voter to be physically present, concerns exist about 
individuals fraudulently obtaining and submitting these 
ballots [19]. By training a model on past ballots, we 
could assess whether suspicious absentee ballots fail to 
match the diversity expected from the population com- 
pleting these forms.’ 

Unfortunately, because bubble markings can serve as 
a biometric, they can also be used in combination with 
seemingly innocuous auxiliary data to undermine ballot 
secrecy. Some jurisdictions now release scanned images 
of ballots following elections with the goal of increasing 
transparency (e.g., Humboldt County, California [14], 
which releases ballot scans at 300 DPI). If someone has 
access to these images or otherwise has the ability to ob- 
tain ballot scans, they can attempt to undermine voter pri- 
vacy. Although elections may be decided by millions of 
voters, an attacker could focus exclusively on ballots cast 
in a target’s precinct. New Jersey readjusts larger elec- 
tion districts to contain fewer than 750 registered voters 
[22]. Assuming 50% turnout, ballots in these districts 
would fall in groups of 375 or smaller. In Wisconsin’s 
contentious 2011 State Supreme Court race, 71% of re- 
ported votes cast fell in the 91% of Wisconsin wards with 
1,000 or fewer total votes [28]. 

Suppose that an interested party, such as a potential 
employer, wishes to determine how you voted. Given 
the ability to obtain bubble markings known to be from 
you (for example, on an employment application), that 
party can replicate our experiment in Section 3.1 to iso- 
late one or a small subset of potential corresponding bal- 
lots. What makes this breach of privacy troubling is that 
it occurs without the consent of the voter and requires 
no special access to the ballots (unlike paper fingerprint- 


7If a state uses bubble form absentee ballot applications, analysis 
could even occur on the applications themselves. 
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ing techniques [4], which require access to the physical 
ballot). The voter has not attempted to make an iden- 
tifying mark, but the act of voting results in identifying 
marks nonetheless. This threat exists not only in tradi- 
tional government elections but also in union and other 
elections. 


Finally, one known threat against voting systems is 
pattern voting. For this threat, an attacker coerces a voter 
to select a preferred option in a relevant race and an un- 
usual combination of choices for the other races. The un- 
usual voting pattern will allow the attacker to locate the 
ballot later and confirm that the voter selected the cor- 
rect choice for the relevant race. One proposed solution 
for pattern voting is to cut ballots apart to separate votes 
in individual contests [6]. Our work raises the possibil- 
ity that physically divided portions of a ballot could be 
connected, undermining this mitigation strategy.® 


4.3 Surveys 


Human subjects research is governed by a variety of re- 
strictions and best practices intended to balance research 
interests against the subjects’ interests. One factor to be 
considered when collecting certain forms of data is the 
level of anonymity afforded to subjects. If a dataset con- 
tains identifying information, such as subject name, this 
may impact the types of data that should be collected and 
procedural safeguards imposed to protect subject privacy. 
If subjects provide data using bubble forms, these mark- 
ings effectively serve as a form of identifying informa- 
tion, tying the form to the subject even in the absence of 
a name. Re-identification of subjects can proceed in the 
same manner as re-identification of voters, by matching 
marks from a known individual against completed sur- 
veys (as in Section 3.1). 


Regardless of whether ethical or legal questions are 
raised by the ability to identify survey respondents, this 
ability might affect the honesty of respondents who are 
aware of the issue. Dishonesty poses a problem even 
for commercial surveys that do not adhere to the typical 
practices of human subjects research. 


The impact of this work for surveys is not entirely neg- 
ative, however. In certain scenarios, the person respon- 
sible for administering a survey may complete the forms 
herself or modify completed forms, whether to avoid the 
work of conducting the survey or to yield a desired out- 
come. Should this risk exist, similar analysis to the stan- 
dardized test and election cases could help uncover the 
issue. 


8We thank an anonymous reviewer for suggesting this possibility. 
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4.4 Authentication 


Because bubble markings are a biometric, they may be 
used alone or in combination with other techniques for 
authentication. Using a finger or a stylus, an individual 
could fill in a bubble on a pad or a touchscreen. Be- 
cause a computer could monitor user input, various de- 
tails such as velocity and pressure could also be collected 
and used to increase the accuracy of identification, poten- 
tially achieving far stronger results than in Section 3.2. 
On touchscreen devices, this technique may or may not 
be easier for users than entry of numeric codes or pass- 
words. Additional testing would be necessary for this ap- 
plication, including tests of its performance in the pres- 
ence of persistent adversaries. 


5 Mitigation 


The impact of this paper’s techniques can be both bene- 
ficial and detrimental, but the drawbacks may outweigh 
the benefits under certain circumstances. In these cases, a 
mitigation strategy is desirable, but the appropriate strat- 
egy varies. We discuss three classes of mitigation strate- 
gies. First, we consider changes to the forms themselves 
or how individuals mark the forms. Second, we exam- 
ine procedural safeguards that restrict access to forms or 
scanned images. Finally, we explore techniques that ob- 
scure or remove identifying characteristics from scanned 
images. No strategy alone is perfect, but various combi- 
nations may be acceptable under different circumstances. 


5.1 Changes to Forms or Marking Devices 


As explored in Section 3.3, changes to the forms them- 
selves such as a gray background can impact the accu- 
racy of our tests. The unintentional protection provided 
by this particular change was mild and unlikely to be 
insurmountable. Nevertheless, more dramatic changes 
to either the forms themselves or the ways people mark 
them could provide a greater measure of defense. 

Changes to the forms themselves should strive to limit 
either the space for observable human variation or the 
ability of an analyst to perceive these variations. The ad- 
dition of a random speckled or striped background in the 
same color as the writing instrument could create diffi- 
culties in cleanly identifying and matching a mark. If 
bubbles had wider borders, respondents would be less 
likely to color outside the lines, decreasing this source 
of information. Bubbles of different shapes or alternate 
marking techniques could encourage less variation be- 
tween users. For example, some optical scan ballots re- 
quire a voter simply to draw a line to complete an arrow 
shape [1], and these lines may provide less identifying 
information than a completed bubble. 
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The marking instruments that individuals use could 
also help leak less identifying information. Some Los 
Angeles County voters use ink-marking devices, which 
stamp a circle of ink for a user [18]. Use of an ink- 
stamper would reduce the distinguishability of markings, 
and even a wide marker could reduce the space for inad- 
vertent variation. 


5.2 Procedural Safeguards 


Procedural safeguards that restrict access to both forms 
themselves and scanned images can be both straightfor- 
ward and effective. Collection of data from bubble forms 
typically relies on scanning the forms, but a scanner need 
not retain image data for any longer than required to pro- 
cess a respondent’s choices. If the form and its image 
are unavailable to an adversary, our techniques would be 
infeasible. 

In some cases, instructive practices or alternative tech- 
niques already exist. For example, researchers conduct- 
ing surveys could treat forms with bubble markings in the 
same manner as they would treat other forms containing 
identifying information. In the context of elections, some 
jurisdictions currently release scanned ballot images fol- 
lowing an election to provides a measure of transparency. 
This release is not a satisfactory replacement for a sta- 
tistically significant manual audit of paper ballots (e.g., 
[2, 5]), however, and it is not necessary for such an au- 
dit. Because scanned images could be manipulated or 
replaced, statistically significant manual confirmation of 
the reported ballots’ validity remains necessary. Further- 
more, releasing the recorded choices from a ballot (e.g., 
Washington selected for President, Lincoln selected for 
Senator, etc.) without a scanned ballot image is sufficient 
for a manual audit. 

Whether the perceived transparency provided by the 
release of ballot scans justifies the resulting privacy risk 
is outside the scope of this paper. Nevertheless, should 
the release of scanned images be desirable, the next sec- 
tion describes methods that strive to protect privacy in 
the event of a release. 


5.3 Scrubbing Scanned Images 


In some cases, the release of scanned bubble forms 
themselves might be desirable. In California, Humboldt 
County’s release of ballot image scans following the 
2008 election uncovered evidence of a software glitch 
causing certain ballots to be ignored [13]. Although a 
manual audit could have caught this error with high prob- 
ability, ballot images provide some protection against un- 
intentional errors in the absence of such audits. 

The ability to remove identifying information from 
scanned forms while retaining some evidence of a re- 
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spondent’s actions is desirable. One straightforward ap- 
proach is to cover the respondent’s recorded choices with 
solid black circles. Barring any stray marks or mis- 
readings, this choice would completely remove all iden- 
tifying bubble patterns. Unfortunately, this approach 
has several disadvantages. First, a circle could cover 
choices that were not selected, hiding certain forms of 
errors. Second, suppose that a bubble is marked but not 
recorded. While the resulting image would allow review- 
ers to uncover the error, such marks retain a respondent’s 
identifying details. The threat of a misreading and re- 
identification could be sufficient to undermine respon- 
dent confidence, enabling coercion. 

An alternative to the use of black circles is to re- 
place the contents of each bubble with its average color, 
whether the respondent is or is not believed to have se- 
lected the bubble. The rest of the scan could be scrubbed 
of stray marks. This would reduce the space for variation 
to color and pressure properties alone. Unfortunately, no 
evidence exists that these properties cannot still be dis- 
tinguishing. In addition, an average might remove a re- 
spondent’s intent, even when that intent may have been 
clear to the scanner interpreting the form. Similar mit- 
igation techniques involve blurring the image, reducing 
the image resolution, or making the image strictly black 
and white, all of which have similar disadvantages to av- 
eraging colors. 


One interesting approach comes from the facial image 
recognition community. Newton et al. [23] describe a 
method for generating k-anonymous facial images. This 
technique replaces each face with a “distorted” image 
that is k-anonymous with respect to faces in the input 
set. The resulting k-anonymous image maintains the ex- 
pected visual appearance, that of a human face. The ex- 
act details are beyond the scope of this paper, but the 
underlying technique reduces the dimensionality using 
Principal Component Analysis and an algorithm for re- 
moving the most distinctive features of each face [23]. 

Application of facial anonymization to bubbles is 
straightforward. Taking marked and unmarked bubbles 
from all ballots in a set, we can apply the techniques 
of Newton et al. to each bubble, replacing it with its k- 
anonymous counterpart. The result would roughly main- 
tain the visual appearance of each bubble while removing 
certain unique attributes. Unfortunately, this approach 
is imperfect in this scenario. Replacement of an image 
might hide a respondent’s otherwise obvious intent. In 
addition, distinguishing trends might occur over multi- 
ple bubbles on a form: for example, an individual might 
mark bubbles differently near the end of forms (this is 
also a problem for averaging the bubble colors). Fi- 
nally, concerns exist over the guarantees provided by k- 
anonymity [3], but the work may be extensible to achieve 
other notions of privacy, such as differential privacy [9]. 
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We caution that the value of these images for proving 
the true contents of physical bubble forms is limited: an 
adversary with access to the images, whether scrubbed 
or not, could intentionally modify them to match a de- 
sired result. These approaches are most useful where the 
primary concern is unintentional error. 


6 Related Work 


Biometrics. Biometrics can be based on physical or 
behavioral characteristics of an individual. Physical bio- 
metrics are based on physical characteristics of a person, 
such as fingerprints, facial features, and iris patterns. Be- 
havioral biometrics are based on behavioral characteris- 
tics that tend to be stable and difficult to replicate, such 
as speech or handwriting/signature [15]. Bubble comple- 
tion patterns are a form of behavioral biometric. 

As a biometric, bubble completion patterns are simi- 
lar to handwriting, though handwriting tends to rely on 
a richer, less constrained set of available features. In e1- 
ther case, analysis may occur on-line or off-line [21]. 
In an on-line process, the verifying party may monitor 
characteristics like stroke speed and pressure. In an off- 
line process, a verifying party only receives the resulting 
data, such as a completed bubble. Handwriting-based 
recognition sometimes occurs in an on-line setting. Be- 
cause off-line recognition is more generally applicable, 
our analysis occurred purely in an off-line manner. In 
some settings, such as authentication, on-line recogni- 
tion would be possible and could yield stronger results. 


Document re-identification. Some work seeks to re- 
identify a precise physical document for forgery and 
counterfeiting detection (e.g., [7]). While the presence 
of biometrics may assist in re-identification, the prob- 
lems discussed in this paper differ. We seek to discover 
whether sets of marked bubbles were produced by the 
same individual. Our work is agnostic towards whether 
the sets come from the same form, different forms, or du- 
plicates of forms. Nevertheless, our work and document 
re-identification provide complementary techniques. For 
example, document re-identification could help deter- 
mine whether the bubble form (ballot, answer sheet, sur- 
vey, etc.) provided to an individual matches the one 
returned or detect the presence of fraudulently added 
forms. 


Cheating Detection. Existing work uses patterns in 
answer choices themselves as evidence of cheating. 
Myagkov et al. [20] uncover indicators of election fraud 
using aggregate vote tallies, turnout, and historical data. 
Similarly, analysis of answers on standardized tests can 
be particularly useful in uncovering cheating [10, 17]. 
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For example, if students in a class demonstrate mediocre 
overall performance on a test yet all correctly answer a 
series of difficult questions, this may raise concerns of 
cheating. The general strategy in this line of research is 
to look for answers that are suspicious in the context of 
either other answers or auxiliary data. 

Bubble-based analysis is also complementary to these 
anti-cheating measures. Each technique effectively iso- 
lates sets of suspicious forms, and the combination of 
the two would likely be more accurate than each inde- 
pendently. Although our techniques alone do not exploit 
contextual data, they have the advantage of being un- 
biased by that data. If a student dramatically improves 
her study habits, the resulting improvement in test scores 
alone might be flagged by other anti-cheating measures 
but not our techniques. 


7 Future Work 


Although a variety of avenues for future work exist, we 
focus primarily on possibilities for additional testing and 
application-specific uses here. 

Our sample surveys allowed a diverse set of tests, but 
access to different datasets would enable additional use- 
ful tests. We are particularly interested in obtaining and 
using longitudinal studies—in which a common set of re- 
spondents fill out bubble forms multiple times over some 
period—to evaluate our methods. While providing an 
increased number of examples, this could also identify 
how a respondent’s markings vary over time, establish 
consistency over longer durations, and confirm that our 
results are not significantly impacted by writing utensil. 
Because bubble forms from longitudinal studies are not 
widely available, this might entail collecting the data our- 
selves. 

While we tested our techniques using circular bubbles 
with numbers inscribed, numerous other form styles ex- 
ist. In some cases, respondents instead fill in ovals or 
rectangles. In other cases, selection differs dramatically 
from the traditional fill-in-the-shape approach—for ex- 
ample, the line-drawing approach discussed in Section 5 
bears little similarity to our sample forms. Testing these 
cases would not only explore the limits of our work but 
could also help uncover mitigation strategies. 

Section 4 discusses a number of applications of our 
techniques. Adapting the techniques to work in these 
scenarios is not always trivial. For example, Section 6 
discusses existing anti-cheating techniques for standard- 
ized tests. Combining the evidence provided by existing 
techniques and ours would strengthen anti-cheating mea- 
sures, but it would also require some care to process the 
data quickly and merge results. 

Use of bubble markings for authentication would re- 
quire both additional testing and additional refinement 
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of our techniques. Given a dataset containing on-line 
information, such as writing instrument position, veloc- 
ity, and pressure, we could add this data to our fea- 
ture vectors and test the accuracy of our techniques with 
these new features. This additional information could 
increase identifiability considerably—signature verifica- 
tion is commonly done on-line due to the utility of this 
data—and may yield an effective authentication system. 
Depending on the application, a bubble-based authenti- 
cation system would potentially need to work with a fin- 
ger rather than a pen or stylus. Because the task of fill- 
ing in a bubble is relatively constrained, this application 
would require cautious testing to ensure that an adversary 
cannot impersonate a legitimate user. 


8 Conclusion 


Marking a bubble is an extremely narrow task, but as 
this work illustrates, the task provides sufficient expres- 
sive power for individuals to unintentionally distinguish 
themselves. Using a dataset with 92 individuals, we 
demonstrate how to re-identify a respondent’s survey 
with over 50% accuracy. In addition, we are able to 
detect an unauthorized respondent with over 92% accu- 
racy with a false positive rate below 10%. We achieve 
these results while performing off-line analysis exclu- 
sively, but on-line analysis has the potential to achieve 
even higher rates of accuracy. 

The implications of this study extend to any system 
utilizing bubble forms to obtain user input, especially 
cases for which protection or confirmation of a respon- 
dent’s identity is important. Additional tests can better 
establish the threat (or benefit) posed in real-world sce- 
narios. Mitigating the amount of information conveyed 
through marked bubbles is an open problem, and so- 
lutions are dependent on the application. For privacy- 
critical applications, such the publication of ballots, we 
suggest that groups releasing data consider means of 
masking respondents’ markings prior to publication. 
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Measuring and Analyzing Search-Redirection Attacks 
in the Illicit Online Prescription Drug Trade 
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Carnegie Mellon University 


Abstract 


We investigate the manipulation of web search re- 
sults to promote the unauthorized sale of prescription 
drugs. We focus on search-redirection attacks, where 
miscreants compromise high-ranking websites and dy- 
namically redirect traffic to different pharmacies based 
upon the particular search terms issued by the consumer. 
We constructed a representative list of 218 drug-related 
queries and automatically gathered the search results 
on a daily basis over nine months in 2010-2011. We 
find that about one third of all search results are one 
of over 7000 infected hosts triggered to redirect to a 
few hundred pharmacy websites. Legitimate pharmacies 
and health resources have been largely crowded out by 
search-redirection attacks and blog spam. Infections per- 
sist longest on websites with high PageRank and from 
.edu domains. 96% of infected domains are connected 
through traffic redirection chains, and network analysis 
reveals that a few concentrated communities link many 
otherwise disparate pharmacies together. We calculate 
that the conversion rate of web searches into sales lies 
between 0.3% and 3%, and that more illegal drugs sales 
are facilitated by search-redirection attacks than by email 
spam. Finally, we observe that concentration in both the 
source infections and redirectors presents an opportunity 
for defenders to disrupt online pharmacy sales. 


1 Introduction and background 


Prescription drugs sold illicitly on the Internet arguably 
constitute the most dangerous online criminal activity. 
While resale of counterfeit luxury goods or software are 
obvious frauds, counterfeit medicines actually endanger 
public safety. Independent testing has indeed revealed 
that the drugs often include the active ingredient, but in 
incorrect and potentially dangerous dosages [48]. 

In the wake of the death of a teenager, the US Congress 
passed in 2008 the Ryan Haight Online Pharmacy Con- 
sumer Protection Act, rendering it illegal under federal 
law to “deliver, distribute, or dispense a controlled sub- 
stance by means of the Internet” without an authorized 
prescription, or “to aid and abet such activity” [35]. Yet, 
illicit sales have continued to thrive in the nearly two 
years since the law has taken effect. In response, the 
White House has recently helped form a group of regis- 
trars, technology companies and payment processors to 
counter the proliferation of illicit online pharmacies [19]. 

Suspicious online retail operations have, for a long 
time, primarily resorted to email spam to advertise their 
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Figure 1: Example of the search-redirection attack. Only 
two of the results actually belong to online pharmacies. The rest are 
unrelated . comor . edu sites that had been compromised to redirect to 
online pharmacies, or have been populated with spam. The top search 


result (framed) was still infected at the time of this writing. 


products. However, the low conversion rates (realized 
sales over emails sent) associated with email spam [22] 
has led miscreants to adopt new tactics. Search-engine 
manipulation [47], in particular, has become widely used 
to advertise products. The basic idea of search-engine 
manipulation is to inflate the position at which a specific 
retailer’s site appears in search results by artificially link- 
ing it from many websites. Conversion rates are believed 
to be much higher than for spam, since the advertised site 
has at least a degree of relevance to the query issued. 

In this paper, we focus on a particularly pernicious 
variant of search-engine manipulation involving compro- 
mised web servers, which we term search-redirection at- 
tacks. Analyzing measurements collected over a nine- 
month interval, we show that search-redirection attacks 
are fast becoming the search engine manipulation tech- 
nique of choice for online miscreants. 


1.1. Search-redirection attacks 


Figure | illustrates the attack. In response to the query 
“cialis without prescription’, the top eight results include 
five .edu sites, one .com site with a seemingly unre- 
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lated domain name, and two online pharmacies. At first 
glance, the .edu and one of the . com sites have abso- 
lutely nothing to do with the sale of prescription drugs. 
However, clicking on some of these links, including the 
top search result framed in Figure 1, takes the visitor not 
to the requested site, but to an online pharmacy store. 
The attack works as follows. The attacker first iden- 
tifies high-visibility websites that are also vulnerable to 
code injection attacks.! Popular targets include outdated 
versions of WordPress [49], phpBB [38], or any other 
vulnerable blogging or wiki software. The code injected 
on the server intercepts all incoming HTTP requests to 
the compromised page and responds differently depend- 
ing on the type of request. 
Requests originating from search-engine crawlers, as 
identified by the User-Agent parameter of the HTTP re- 
quest, return a mix of the compromised site’s original 
content plus numerous links to websites promoted by the 
attacker (e.g., other compromised sites, online stores). 
This technique, “link stuffing,’ has been observed for 
several years [34] in non-compromised websites. 
Requests originating from pages of search results, 
for queries deemed relevant to what the attacker wants 
to promote, are redirected to a website of the at- 
tacker’s choosing. The compromised web server au- 
tomatically identifies these requests based on the Re- 
ferrer field that HTTP requests carry [14]. The Re- 
ferrer actually contains the complete resource identifier 
(URD) that triggered the request. For instance, in Fig- 
ure 1, when clicking on any of the links, the Referrer 
field is setto http://www. google.com/search? 
q=cialistwithout+prescription. Upon de- 
tecting the pharmacy-related query, the server sends an 
HTTP redirect with status code 302 (Found) [14], along 
with a location field containing the desired pharmacy 
website or intermediary. The upshot is that the end user 
unknowingly visits a series of websites culminating in a 
fake pharmacy without ever spending time at the original 
site appearing in the search results. A similar technique 
has been extensively used to distribute malware [40], 
while web spammers have also used the technique to hide 
the true nature of their sites from investigators [33]. 
All other requests, including typing the URI directly 
into a browser, return the original content of the website. 
Therefore, website operators cannot readily discern that 
their website has been compromised. As we will show 
in Section 4, as a result of this “cloaking” mechanism, 
some of the victim sites remain infected for a long time. 
While each of the components (link stuffing, redirec- 
tion chains) of the search-redirection attack has been pre- 
viously observed, to our knowledge, no study has inves- 
tigated the combined attack itself, its effect on search re- 


'We defer the study of the specific exploits to future work. Our 
focus in this paper is the outcome of the attack, not the attack itself. 
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sults, or the potential harm it inflicts. 

Three classes of websites are involved in search- 
redirection attacks. Source infections are innocent web- 
sites that have been compromised and reprogrammed 
with the behavior just described; redirectors are inter- 
mediary websites that receive traffic from source infec- 
tions; and retailers (here, pharmacies) are destination 
websites that receive traffic from redirectors. 

It is not immediately obvious who the victim is 
in search-redirection attacks. Unlike in drive-by- 
downloads [40], end users issuing pharmacy searches 
are not necessarily victims, since they are actually often 
seeking to illegally procure drugs online. In fact, here, 
search engines do provide results relevant to what users 
are looking for, regardless of the legality of the products 
considered. However, users may also become victims 
if they receive inaccurately dosed medicine or danger- 
ous combinations that can cause physical harm or death. 
The operators of source infections are victims, but only 
marginally so, since they are not directly harmed by redi- 
recting traffic to pharmacies. Pharmaceutical companies 
are victims in that they may lose out on legitimate sales. 
The greatest harm is a societal one, because laws de- 
signed to protect consumers are being openly flouted. 


1.2. Summary of our contributions 


Our study contributes to the understanding of online 
crime and search engine manipulation in several ways. 

First, we collected search results over a nine-month 
interval (April 2010—-February 2011). The data com- 
prises daily returns from April 12, 2010-October 21, 
2010, complemented by an additional 10 weeks of data 
from November 15th 2010—February Ist 2011. Com- 
bining both datasets, we gathered about 185000 dif- 
ferent universal resource identifiers (pharmacies, benign 
and compromised sites), of which around 63000 were 
infected. We describe our measurement infrastructure 
and methodology in details in Section 2, and discuss the 
search results in Section 3. 

Second, we show that a quarter of the top 10 search 
results actively redirect from compromised websites to 
online pharmacies at any given time. We show infected 
websites are very slowly remedied: the median infection 
lasts 46 days, and 16% of all websites have remained in- 
fected throughout the study. Further, websites with high 
reputation (e.g., high PageRank) remain infected and ap- 
pear in the search results much longer than others. 

Third, we provide concrete evidence of the existence 
of large, connected, advertising “affiliate” networks, fun- 
neling traffic to over 90% of the illicit online pharmacies 
we encountered. Search-redirection attacks play a key 
role in diverting traffic to questionable retail operations 
at the expense of legitimate alternatives. 

Fourth, we analyze whether sites involved in the phar- 


USENIX Association 


maceutical trade are involved in other forms of sus- 
picious retail activities, in other security attacks (e.g., 
serving malware-infested pages), or in spam email cam- 
paigns. While we find occasional evidence of other ne- 
farious activities, many of the pharmacies we inspect ap- 
pear to have moved away from email spam-based adver- 
tising. We discuss infection characteristics, affiliate net- 
works, and relationship with other attacks in Section 4. 

Fifth, we derive a rough estimate of the conversion 
rates achieved by search-redirection attacks, and show 
they are considerably higher than those observed for 
spam campaigns. We present this analysis in Section 5. 

Sixth, we consider a range of mitigation strategies that 
could reduce the harm caused by search-redirection at- 
tacks in Section 6. 

In addition to these contributions, we compare our 
study with related work in Section 7, before concluding 
in Section 8, where we also describe ongoing work track- 
ing the promotion of other types of fraudulent goods. 


2 Measurement methodology 


We now explain the methodology used to identify search- 
redirection attacks that promote online pharmacies. We 
first describe the infrastructure for data collection, then 
how search queries are selected, and finally how the 
search results are classified. 


2.1 Infrastructure overview 


The measurement infrastructure comprises two distinct 
components: a search-engine agent that sends drug- 
related queries and a crawler that checks for behavior 
associated with search-redirection attacks.” 

The search-engine agent uses the Google Web Search 
API [2] to automatically retrieve the top 64 search re- 
sults to selected queries. From manually inspecting some 
compromised websites, we found that search-redirection 
attacks frequently also work on other search engines. 
Every 24 hours, the search-engine agent automatically 
sends 218 different queries for prescription drug-related 
terms (e.g., “cialis without prescription’’) and stores all 
13952 (= 64 x 218) URIs returned. We explain how we 
selected the corpus of 218 queries in Section 2.2. 

The crawler module then contacts each URI collected 
by the search-engine agent and checks for HTTP 302 
redirects mentioned in Section 1.1. The crawler emulates 
typical web-search activity by setting the User-Agent and 
Referrer terms appropriately in the HTTP headers. Ini- 
tial tests revealed that some source infections had been 
programmed to block repeated requests from a single IP 
address. Consequently, all crawler requests are tunneled 
through the Tor network [11] to circumvent the blocking. 


2 All results gathered by the crawler are stored in a mySQL database, 
available from http: //arima.ini.cmu.edu/rx.sql.gz. 
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2.2 Query selection 


Selecting appropriate queries to feed the search-engine 
agent is critical for obtaining suitable quality, coverage 
and representativeness in the results. We began by issu- 
ing a single seed query, “no prescription vicodin,” cho- 
sen for the many source infections it returned at the time 
(March 3, 2010). We then browsed the top infected re- 
sults posing as a search engine crawler. As described in 
Section 1.1, infected servers present different results to 
search-engine crawlers. The pages include a mixture of 
the site’s original content and a number of drug-related 
search phrases designed to make the website attractive 
to search engines for these queries. The inserted phrases 
typically linked to other websites the attacker wishes to 
promote, in our case other online pharmacies. 

We compiled a list of promoted search phrases by vis- 
iting the linked pharmacies posing as a search-engine 
crawler and noting the phrases observed. Many phrases 
were either identical or contained only minor differences, 
such as spelling variations on drug names. We reduced 
the list to a corpus of 48 unique queries, representative 
of all drugs advertised in this first step. 

We then repeated this process for all 48 search phrases, 
gathering results daily from March 3, 2010 through April 
11, 2010. The 48-query search subsequently led us to 
371 source infections. We again browsed each of these 
source infections posing as a search engine crawler, and 
gathered a few thousand search phrases linked from the 
infected websites. After again sorting through the dupli- 
cates, we got a corpus of 218 unique search queries. 

The risk of starting from a single seed is to only iden- 
tify a single unrepresentative campaign. Hence, we ran a 
validation experiment to ensure that our selected queries 
had satisfactory coverage. We obtained a six-month sam- 
ple of spam email (collected at a different time period, 
late 2009) gathered in a different context [42]. We ran 
SpamAssassin [5] on this spam corpus, to classify each 
spam as either pharmacy-related or otherwise. We then 
extracted all drug names encountered in the pharmacy- 
related spam, and observed that they defined a subset of 
the drug names present in our search queries. This gave 
us confidence that the query corpus was quite complete. 

We further validated our query selection by comparing 
results obtained with our query corpus to those collected 
from two additional query corpora: 1) searches ran on 
an exhaustive list of 9000 prescription drugs obtained 
from the US Food & Drug Administration [15], and 
2) 1179 drug-related search queries extracted from the 
HTTP logs of 169 source websites. The results (in Ap- 
pendix A) confirm adequate coverage of our 218 queries. 


2.3. Search-result classification 


We attempt to classify all results obtained by the search- 
engine agent. Each query returns a mix of legitimate re- 
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sults (e.g., health information websites) and abusive re- 
sults (e.g., spammed blog comments and forum postings 
advertising online pharmacies). We seek to distinguish 
between these different types of activity to better under- 
stand the impact of search-redirection attacks may have 
on legitimate pharmacies and other forms of abuse. We 
assign each result into one of the following categories: 
1) search-redirection attacks, 2) health resources, 3) le- 
gitimate online pharmacies, 4) illicit online pharmacies, 
5) blog or forum spam, and 6) uncategorized. 

We mark websites as participating in search- 
redirection attacks by observing an HTTP redirect to 
a different website. Legitimate websites regularly use 
HTTP redirects, but it is less common to redirect to en- 
tirely different websites immediately upon arrival from 
a search engine. Every time the crawler encounters a 
redirect, it recursively follows and stores the intermedi- 
ate URIs and IP addresses encountered in the database. 
These redirection chains are used to infer relationships 
between source infections and pharmacies in Section 4.3. 

We performed two robustness checks to assess the 
suitability of classifying all external redirects as attacks. 
First, we found known drug terms in at least one redirect 
URI for 63% of source websites. Second, we found that 
86% of redirecting websites point to the same website 
as 10 other redirecting websites. Finally, 93% of redi- 
recting websites exhibit at least one of these behaviors, 
suggesting that the vast majority of redirecting websites 
are infected. In fact, we expect that most of the remain- 
ing 7% are also infected, but some attackers use unique 
websites for redirection. Thus, treating all external redi- 
rects as malicious appears reasonable in this study. 

Health resources are websites such as webmd.com 
that describe characteristics of a drug. We used the Alexa 
Web Information Service API [1], which is based on the 
Open Directory [4] to determine each website category. 

We distinguish between legitimate and illicit online 
pharmacies by using a list of registered pharmacies ob- 
tained from the non-profit organization Legitscript [3]. 
Legitscript maintains a whitelist of 324 confirmed legit- 
imate online pharmacies, which require a verified doc- 
tor’s prescription and sell genuine drugs. Illicit phar- 
macies are websites which do not appear in Legitscript’s 
whitelist, and whose domain name contains drug names 
or words such as “pill,” “tabs,” or “prescription.” Legit- 
Script’s list is likely incomplete, so we may incorrectly 
categorize some collected legitimate pharmacies as il- 
licit, because they have not been certified by LegitScript. 

Finally, blog and forum spam captures the frequent oc- 
currence where websites that allow user-generated con- 
tent are abused by users posting drug advertisements. We 
classify these websites based only on the URI structure, 
since collecting and storing the pages referenced by URIs 
is cost-prohibitive. We first check the URI subdomain 
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URIs Domains 

# % # % 
Source infections 73909 53.8 4652 20.2 
Active 44503 32.4 2907 12.6 
Inactive 29406 21.4 1745 7.6 
Health resources 1817 1.3 422 1.8 
Pharmacies 4348 3.2 2 138 9.3 

Legitimate 12 0.01 9 0.04 

Illicit 4336 3.2 2129 9.2 
Blog/forum spam 41335 30.1 8064 34.9 
Uncategorized 15945 11.6 7766 = 33.7 
Total 137354 100.0 23042 100.0 


Table 1: Classification of all search results (4—10/2010). 


and path for common terms indicating user-contributed 
content, such as “blog,” “viewmember” or “profile.” We 
also check any remaining URIs for drug terms appearing 
in the subdomain and path. While these might in fact be 
compromised websites that have been loaded with con- 
tent, upon manual inspection the activity appears consis- 


tent with user-generated content abuse. 


3 Empirical analysis of search results 


We begin our measurement analysis by examining the 
search results collected by the crawler. The objective 
here is to understand how prevalent search-redirection 
attacks are, in both absolute terms and relative to legit- 
imate sources and other forms of abuse. 


3.1 Breakdown of search results 


Table | presents a breakdown of all search results ob- 
tained during the six months of primary data collection. 
137 354 distinct URIs correspond to 23 042 different do- 
mains. We observed 44503 of these URIs to be com- 
promised websites (source infections) actively redirect- 
ing to pharmacies, 32% of the total. These corresponded 
to 4652 unique infected source domains. We examine 
the redirection chains in more detail in Section 4.3. 

An additional 29 406 URIs did not exhibit redirection 
even though they shared domains with URIs where we 
did observe redirection. There are several plausible ex- 
planations for why only some URIs on a domain will 
redirect to pharmacies. First, websites may continue to 
appear in the search results even after they have been re- 
mediated and stop redirecting to pharmacies. In Figure 1, 
the third link to appear in the search engine results has 
been disinfected, but the search engine is not yet aware 
of that. For 17% of the domains with inactive redirection 
links, the inactive links only appear in the search results 
after all the active redirects have stopped appearing. 

However, for the remaining 83% of domains, the in- 
active links are interspersed among the URIs which ac- 
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(a) Distribution of different classes of results ac- 
cording to the position in the search results. 


(b) Change in the average domains 
observed each day for 
classes of search results over time. 


(c) Search-redirection attacks appear in 
many queries; health resources and blog 
spam appear less often in popular queries. 


different 


Figure 2: Empirical measurements of pharmacy-related search results. 


tively redirect. In this case, we expect that the miscre- 
ants’ search engine optimization has failed, incorrectly 
promoting pages on the infected website that do not redi- 
rect to pharmacies. 

By comparison, very few search results led to legiti- 
mate resources. 1817 URIs, 1.3% of the total, pointed 
to websites offering health resources. Even more strik- 
ing, only nine legitimate pharmacy websites, or 0.04% 
of the total, appeared in the search results. By contrast, 
2 129 illicit pharmacies appeared directly in the search 
results. 30% of the results pointed to legitimate web- 
sites where miscreants had posted spam advertisements 
to online pharmacies. In contrast to the infected web- 
sites, these results require a user to click on the link to 
arrive at the pharmacy. It is also likely that many of these 
results were not intended for end users to visit; instead, 
they could be used to promote infected websites higher 
in the search results. 


3.2 Variation in search position 


Merely appearing in search results is not enough 
to ensure success for miscreants perpetrating search- 
redirection attacks. Appearing towards the top of the 
search results is also essential [20]. To that end, we col- 
lected data for an additional 10 weeks from November 
15th 2010 to February 1st 2011 where we recorded the 
position of each URI in the search results. 

Figure 2(a) presents the findings. Around one third of 
the time, search-redirection attacks appeared in the first 
position of the search results. 17% of the results were 
actively redirecting at the time they were observed in the 
first position. Blog and forum spam appeared in the top 
spot in 30% of results, while illicit pharmacies accounted 
for 22% and legitimate health resources just 5%. 

The distribution of results remains fairly consistent 
across all 64 positions. Active search-redirection attacks 
increase their proportion slightly as the rankings fall, ris- 
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ing to 26% in positions 6-10. The share of illicit pharma- 
cies falls considerably after the first position, from 22% 
to 14% for positions 2-10. Overall, it is striking how 
consistently all types of manipulation have crowded out 
legitimate health resources across all search positions. 


3.3. Turnover in search results 


Web search results can be very dynamic, even with- 
out an adversary trying to manipulate the outcome. We 
count the number of unique domains we observe in 
each day’s sample for the categories outlined in Sec- 
tion 2. Figure 2(b) shows the average daily count for two- 
week periods from May 2010 to February 2011, cov- 
ering both sample periods. The number of illicit phar- 
macies and health resources remains fairly constant over 
time, whereas the number of blogs and forums with phar- 
maceutical postings fell by almost half between May 
and February. Notably, the number of source infections 
steadily increased from 580 per day in early May to 895 
by late January, a 50% increase in daily activity. 


3.4 Variation in search queries 


As part of its AdWords program, Google offers a free 
service called Traffic Estimator to check the estimated 
number of global monthly searches for any phrase.* We 
fetched the results for the 218 pharmacy search terms 
we regularly check; in total, over 2.4 million searches 
each month are made using these terms. This gives us 
a good first approximation of the relative popularity of 
web searches for finding drugs through online pharma- 
cies. Some terms are searched for very frequently (as 
much as 246 000 times per month), while other terms are 
only searched for very occasionally. 

We now explore whether the quality of search results 
vary according to the query’s popularity. We might ex- 
pect that less-popular search terms are easier to manip- 


3https://adwords.google.com/select/TrafficEstimatorSandbox 
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ulate, but also that there could be more competition to 
manipulate the results of popular queries. 

Figure 2(c) plots the average number of unique URIs 
observed per query for each category. For unpopular 
searches, with less than 100 global monthly searches, 
search-redirection attacks and blog spam appear with 
similar frequency. However, as the popularity of the 
search term increases, search-redirection attacks con- 
tinue to appear in the search results with roughly the 
same regularity, while the blog and forum spam drops 
considerably (from 355 URIs per query to 105). 

While occurring on a smaller scale, the trends of illicit 
pharmacies and legitimate health resources are also note- 
worthy. Health resources become increasingly crowded 
out by illicit websites as queries become more popular. 
For unpopular queries (< 100 global monthly searches), 
13 health URIs appear. But for queries with more than 
100000 results, the number of results falls by more than 
half to 6. For illicit pharmacies, the trends are opposite. 
On less popular terms, the pharmacies appear less of- 
ten (24 times on average). For the most popular terms, 
by contrast, 54 URIs point directly to illicit pharmacies. 
Taken together, these results suggest that the more so- 
phisticated miscreants do a good job of targeting their 
websites to high-impact results. 


4 Empirical analysis of search-redirection 
attacks 
We now focus our attention on the structure and dynam- 
ics of search-redirection attacks themselves. We present 
evidence that certain types of websites are disproportion- 
ately targeted for compromise, that a few such websites 
appear most prominently in the search results, and that 
the chains of redirections from source infections to phar- 
macies betray a few clusters of concentrated criminality. 


4.1 Concentration in search-redirection at- 
tack sources 


We identified 7 298 source websites from both data sets 
that had been infected to take part in search-redirection 
attacks — 4652 websites in the primary 6-month data set 
and 3 686 in the 10-week follow-up study. (1 130 sites 
are present in both datasets.) We now define a measure 
of the relative impact of these infected websites in order 
to better understand how they are used by attackers. 


Tq -1 
T(domain) = S° > Uga * 0.5 ro 


q€queries dédays 
where 
: Lif domain in results of query g on 
day d & actively redirects to pharmacy 
: 0 otherwise 


: domain’s position (1..64) in search results 
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Figure 3: Rank-order CDF of domain impact reveals 
high concentration in search-redirection attacks. 





-com .org .edu .net_ other 
% global Internet 45% 4% =< 3% 6% 42% 
% infected sources 55% 16% 6% 6% 17% 
% inf. source impact 30% 24% 35% 2% 10% 


Table 2: TLD breakdown of source infections. 


The goal of the impact measure TZ is to distill the many 
observations of an infected domain into a comparable 
scalar value. Essentially, we add up the number of times 
a domain appears, while compensating for the relative 
ranking of the search results. Intuitively, when a domain 
appears as the top result it is much more likely to be uti- 
lized than if it appeared on page four of the results. The 
heuristic we use normalizes the top result to 1, and dis- 
counts the weighting by half as the position drops by 10. 
This corresponds to regarding results appearing on page 
one as twice as valuable as those on page two, which are 
twice as valuable as those on page three, and so on. 


Some infected domains appeared in the search results 
much more frequently and in more prominent positions 
than others. The domain with the greatest impact — 
unm. edu — accounted for 2% of the total impact of all 
infected domains. Figure 3 plots using a logarithmic x- 
axis the ordered distribution of the impact measure Z for 
source domains. The top 1% of source domains account 
for 32% of all impact, while the top 10% account for 
81% of impact. This indicates that a small, concentrated 
number of infected websites account for most of the most 
visible redirections to online pharmacies. 

We also examined how the prevalence and impact of 
source infections varied according to top-level domain 
(TLD). The top row in Table 2 shows the relative preva- 
lence of different TLDs on the Internet [46]. The sec- 
ond row shows the occurrence of infections by TLD. 
The most affected TLD, with 55% of infected results, 
is .com, followed by .org (16%), .edu (6%) and 
.net(6%). These four TLDs account for 83% of all 
infections, with the remaining 17% spread across 159 
TLDs. We also observed 25 infected . gov websites and 
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22 governmental websites from other countries. 

One striking conclusion from comparing these figures 
is how more ‘reputable’ domains, such as .com (55% 
of infections vs. 45% of registrations), .org (16% vs. 
4%) and .edu (6% vs. < 3%), are infected than others. 
This is in contrast to other research, which has identified 
country-specific TLDs as sources of greater risk [26]. 

Furthermore, some TLDs are used more frequently in 
search-redirection attacks than others. While .edu do- 
mains constitute only 6% of source infections, they ac- 
count for 35% of aggregate impact through redirections 
to pharmacy websites. Domains in .com, by contrast, 
account for more than half of all source domains but 30% 
of all impact. We next explore how infection durations 
vary across domains, in part with respect to TLD. 


4.2 Variation in source infection lifetimes 


One natural question when measuring the dynamics of 
attack and defense is how long infections persist. We de- 
fine the “lifetime” of a source infection as the number of 
days between the first and last appearance of the domain 
in the search results while the domain is actively redi- 
recting to pharmacies. Lifetime is a standard metric in 
the empirical security literature, even if the precise def- 
initions vary by the attacks under study. For example, 
Moore and Clayton [27] observed that phishing websites 
have a median lifetime of 20 hours, while Nazario and 
Holz [32] found that domains used in fast-flux botnets 
have a mean lifetime of 18.5 days. 

Calculating the lifetime of infected websites is not en- 
tirely straightforward, however. First, because we are 
tracking only the results of 218 search terms, we count 
as “death” whenever an infected website disappears from 
the results or stops redirecting, even if it remains in- 
fected. This is because we consider the harm to be mini- 
mized if the search engine detects manipulation and sup- 
presses the infected results algorithmically. However, to 
the extent that our search sample is incomplete, we may 
be overly conservative in claiming a website is no longer 
infected when it has only disappeared from our results. 

The second subtlety in measuring lifetimes is that 
many websites remain infected at the end our study, mak- 
ing it impossible to observe when these infections are 
remediated. Fortunately, this is a standard problem in 
statistics and can be solved using survival analysis. Web- 
sites that remain infected and in the search results at the 
end of our study are said to be right-censored. 1368 of 
the 4652 infected domains (29%) are right-censored. 

The survival function S(t) measures the probability 
that the infection’s lifetime is greater than time ¢t. The 
survival function is similar to a complementary cumu- 
lative distribution function, except that the probabilities 
must be estimated by taking censored data points into ac- 
count. We use the standard Kaplan-Meier estimator [23] 
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to calculate the survival function for infection lifetimes, 
as indicated by the solid black line in the graphs of Fig- 
ure 4. The median lifetime of infected websites is 47 
days; this can be seen in the graph by observing where 
S(t) = 0.5. Also noteworthy is that at the maximum 
time t = 192, S(t) = 0.160. Empirical survival estima- 
tors such as Kaplan-Meier do not extrapolate the survival 
distribution beyond the longest observed lifetime, which 
is 192 days in our sample. What we can discern from the 
data, nonetheless, is that 16% of infected domains were 
in the search results throughout the sample period, from 
April to October. Thus, we know that a significant mi- 
nority of websites have remained infected for at least six 
months. Given how hard it is for webmasters to detect 
compromise, we expect that many of these long-lived in- 
fections have actually persisted far longer. 

We next examine the characteristics of infected web- 
sites that could lead to longer or shorter lifetimes. One 
possible source of variation to consider is the TLD. Fig- 
ure 4 (left) also includes survival function estimates for 
each of the four major TLDs, plus all others. Survival 
functions to the right of the primary black survival graph 
(e.g., .edu) have consistently longer lifetimes, while 
plots to the left (e.g., other and . net) have consistently 
shorter lifetimes. Infections on .com and .org appear 
slightly longer than average, but fall within the 95% con- 
fidence interval of the overall survival function. 

The median infection duration of .edu websites is 
113 days, with 33% of .edu domains remaining in- 
fected throughout the 192-day sample period. By con- 
trast, the less popular TLDs taken together have a median 
lifetime of just 28 days. 

Another factor beyond TLD is also likely at play: the 
relative reputation of domains. Web domains with higher 
PageRank are naturally more likely to appear at the top 
of search results, and so are more likely to persist in the 
results. Indeed, we observe this in Figure 4 (center). In- 
fected websites with PageRank 7 or higher have a me- 
dian lifetime of 153 days, compared to just 17 days for 
infections on websites with PageRank 0. 

One might expect that .edu domains would tend to 
have higher PageRanks, and so it is natural to wonder 
whether these graphs indicate the same effect, or two dis- 
tinct effects. To disentangle the effects of different web- 
site characteristics on lifetime, we use a Cox proportional 
hazard model [10] of the form: 


h(t) = exp(@ + PageRankx, + TLD) 


Note that the dependent variable included in the Cox 
model is the hazard function h(t). The hazard function 
h(t) expresses the instantaneous risk of death at time t. 
Cox proportional hazard models are used on survival data 
in preference to standard regression models, but the aim 
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coef. exp(coef.) Std. Err.) Significance 
PageRank —0.085 0.92 0.0098 p< 0.001 
-edu —0.26 0.77 0.086 p< 0.001 
-net 0.08 1.1 0.084 
Org 0.055 1.0 0.054 
other TLDs 0.34 1.4 0.053 p< 0.001 





log-rank test: Q=158, p < 0.001 
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Figure 4: Survival analysis of search-redirection attacks shows that TLD and PageRank influence infection lifetimes. 


is the same as for regression: to measure the effect of dif- 
ferent independent factors (in our case, TLD and PageR- 
ank) on a dependent variable (in our case, infection life- 
time). PageRank is included as a numerical variable val- 
ued from 0 to 9, while TLD is encoded as a five-part 
categorical variable using deviation coding. (Deviation 
coding is used to measure each categories’ deviation in 
lifetime from the overall mean value, rather than devia- 
tions across categories.) The results are presented in the 
table in Figure 4. PageRank is significantly correlated 
with lifetimes — lower PageRank matches shorter life- 
times while higher PageRank is associated with longer 
lifetimes. Separately, . edu domains are correlated with 
longer lifetimes and other TLDs to shorter lifetimes. 

Coefficients in Cox models cannot be interpreted quite 
as easily as in standard linear regression; exponents 
(column 3 in the table) offer the clearest interpretation. 
exp(PageRank) = 0.92 indicates that each one-point in- 
crease in the site’s PageRank decreases the hazard rate 
by 8%. Decreases in the hazard leads to longer lifetimes. 
Meanwhile, exp(.edu) = 0.77 indicates that the pres- 
ence of a . edu domain, holding the PageRank constant, 
decreases the hazard rate by 23%. In contrast, the pres- 
ence of any TLD besides . com, .edu, .net and .org 
increases the hazard rate by 40%. 

Therefore, we can conclude from the model that both 
PageRank and TLD matter. Even lower-ranked univer- 
sity websites and high-rank non-university websites are 
being effectively targeted by attackers redirected traffic 
to pharmacy websites. 


4.3 Characterizing the online pharmacy 
network 


We now extend consideration beyond the websites di- 
rectly appearing in search results to the intermediate and 
destination websites where traffic is driven in search- 
redirection attacks. We use the data to identify connec- 
tions between a priori unrelated online pharmacies. 

We construct a directed graph G = (V,£) as fol- 
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lows. We gather all URIs in our database that are 
part of a redirection chain (source infection, redirec- 
tor, online pharmacy) and assign each second-level do- 
main to a node v € V. We then create edges between 
nodes whenever domains redirect to each other. Sup- 
pose for instance that http: //www.example.com/ 
blog is infected and redirects to http://1337. 
attacker.test which in turns redirects to http: 
//www32.cheaprx4u.test. We then create three 
nodes v; = example.com, vg = attacker.test 
and v3 = cheaprx4u.test, and two edges, v1 — v2 
and v2 + v3. Now, if http://haxOr.attacker. 
test is also present in the database, and redirects 
to http: //www.otherrx.test, we create a node 
vg = otherrx.test and establish an edge vg > v4. 

In the graph G so built, online pharmacies are usually 
leaf nodes with a positive in-degree and out-degree zero.* 
Compromised websites feeding traffic to pharmacies are 
generally represented as sources, with an in-degree of 
zero and a positive out-degree. Traffic redirectors, which 
act as intermediaries between compromised websites and 
online pharmacies have positive in- and out-degrees. 

The resulting graph G for our entire database con- 
sists of 34 connected subgraphs containing more than 
two nodes. The largest connected component Gg con- 
tains 96% of all infected domains, 90% of the redirection 
domains and 92% of the pharmacy domains collected 
throughout the six-month collection period. 

In other words, we have evidence that most online 
pharmacies are connected by redirection chains. While 
this does not necessarily indicate that a single criminal 
organization is behind the entire online pharmacy net- 
work, this does tell us that most illicit online pharmacies 
in our measurements are obtaining traffic from a large 
interconnected network of advertising affiliates. Under- 
cover investigations have confirmed the existence of such 
affiliate networks and provided anecdotal evidence on 


4Manually checking the data, we find a few pharmacies have an 
out-degree of 1, and redirect to other pharmacies. 
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(a) Structure of the giant component Gg that links 
96% of infected domains. Links between vertices are 
based on observed traffic redirection chains. Vertices 
are colored according to their community. 


(b) CDF of nodes in the giant compo- 
nent belonging to different communi- 
ties. The largest 7 (out of 73) communi- 
ties comprise over half the nodes. 


(c) Scatter plot of in- and out-degree of 
nodes in the giant component. (Log-log 
scale, where 0 is technically represented 
as 0.1.) 


Figure 5: Network analysis of redirection chains reveals community structure in search-redirection attacks. 


their operations [44], but they have not precisely quanti- 
fied their influence. These affiliate networks consist of a 
loosely organized set of independent advertising entities 
that feed traffic to their customers (e.g., online retailers) 
in exchange for a commission on any resulting sales. 


Communities and affiliated campaigns. To uncover af- 
filiate networks, we locate communities within Go, i.e., 
sets of vertices closely interconnected with each other 
and only loosely connected to the rest of the graph. Here, 
each community represents a set of domains in close re- 
lationship with each other, possibly part of the same busi- 
ness operation, or in the same manipulation campaigns. 
Several algorithms have recently been proposed for com- 
munity detection, e.g., [36,41,43]. We use the spin-glass 
model proposed by Reichardt and Bornhold [43] (with 
q = 500, y = 1) because its stochastic nature allows it 
to complete quickly even on large graphs like ours, and 
because it works on directed graphs. 


In Figure 5(a), we plot a visual representation of Go. 
Different colors denote different communities. The com- 
munity detection algorithm identifies a total of 73 distinct 
communities. Most larger communities can be observed 
in the dense clusters of nodes in the center of the fig- 
ure, and it appears that less than a dozen of communi- 
ties play a significant role. More precisely, we plot in 
Figure 5(b) the cumulative fraction of nodes in Go as a 
function of the number of communities considered. The 
graph shows that the seven largest communities account 
for more than half of the nodes in the graph, and that 
about two thirds of the nodes belong to one of the top 
twelve communities. In other words, a relatively small 
number of loosely interconnected, possibly distinct, op- 
erations is responsible for most attacks. 
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Manual inspection confirms these insights. For in- 
stance, the third largest community (400 nodes) consists 
of compromised hosts primarily sending traffic to a sin- 
gle redirector, which itself redirects to a single pharmacy 
(securetabs.net). 

Figure 5(c) is a scatter-plot of the in- and out-degree 
of each node in Go. A vast majority of nodes are source 
infections (null in-degree, high out-degree, i.e., points 
along the y-axis) or pharmacies (low out-degree, high in- 
degree, i.e., along the x-axis). Redirectors, with non-zero 
in- and out- degrees are comparatively rare. We identify 
314 redirectors in Gg, out of which only 127 have both 
an in- and an out-degree greater than two. 103 of these 
127 redirectors (80%) are cut vertices for Go. That is, re- 
moving any of these 103 redirectors would partition Go. 
We will discuss these interesting properties in further de- 
tails in Section 6, where we detail the possible remedial 
strategies against the search-redirection attacks. 


4.4 Attack websites in blacklists 


The websites we have identified here have either been 
compromised (in the case of source infections) or have 
taken advantage of compromised servers (in the case 
of redirects and pharmacies). Given such insalubrious 
circumstances, we wondered if any of the third party 
blacklists dedicated to identifying Internet wickedness 
might also have noticed these same websites. To that 
end, we consulted three different sources: Google’s Safe 
Browsing API, which identifies web-based malware; the 
zen.spamhaus.org blacklist, which identifies email 
spam senders; and McAfee SiteAdvisor, which tests 
websites for “spyware, spam and scams”. 

Figure 6 plots sets of Venn diagrams of the three black- 
lists for each class of attack domain. Several trends are 
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Figure 6: Comparing web and email blacklists. 





Mean Median % Searches> 0 Total 
Main 14388 1600 73% 2374085 
FDA drugs 74 0 6% 323 104 
Extra queries 46380 1300 59% 32652121 
Total 6771 0 20% 35343610 





Table 3: Monthly search query popularity according to 
the Google Adwords Traffic Estimator. 
apparent from inspecting the diagrams. First, source in- 
fections are not widely reported by any of the blacklists 
(95% do not appear on a single blacklist), but around half 
of the redirects are found on at least one blacklist and 
over two thirds of pharmacy websites show up on at least 
one blacklist. Surprisingly, 12% of redirects appear on 
the email spam blacklist, as well as 24% of pharmacies. 
We speculate that this could be caused by affiliates adver- 
tising pharmacy domains in email spam, but it could also 
be that the pharmacies directly send email spam adver- 
tisements or use botnets for both hosting and spamming. 
The level of coverage of Google and SiteAdvisor are 
comparable, which is somewhat surprising given SiteAd- 
visor’s relatively broader remit to flag scams, not only 
malware. Google’s more comprehensive coverage of 
pharmacy websites in particular suggests that some phar- 
macies may also engage in distributing malware. We 
conclude by noting that the majority of websites affected 
by the traffic redirection scam are not identified by any of 
these blacklists. This in turn suggests that relatively lit- 
tle pressure is currently being applied to the miscreants 
carrying out the attacks. 


5 Towards a conversion rate estimate 


While it is difficult to measure precisely as an outsider, 
we nonetheless would like to provide a ballpark figure 
for how lucrative web search is to the illicit online pre- 
scription drug trade. Here we measure two aspects of the 
demand side: search-query popularity and sales traffic. 
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For the first category, we once again turn to the Google 
Traffic Estimator to better understand how many peo- 
ple use online pharmacies advertised though search- 
redirection attacks. Table 3 lists the results for each of 
the three search query corpora described in Section 2.2 
and Appendix A. The main and extra queries attract the 
most visitors, with a median of 1600 monthly searches 
for the main sample and 1| 300 for the extra queries. Sev- 
eral highly popular terms appeared in the results: “vi- 
agra” and “pharmacy” each attract 6 million monthly 
searches, while “cialis” and “phentermine” appear in 
around 3 million each. By contrast, only 6% of the search 
queries in the FDA sample registered with the Google 
tool. The FDA query list includes around 6500 terms, 
which dwarfs the size of the other lists. Since over 90% 
of the FDA queries are estimated to have no monthly 
searches, the overal median popularity is also zero. 

While these search terms do not cover all possible 
queries, taken together they do represent a useful lower 
bound on the global monthly searches for drugs. To 
translate the aggregate search count into visits to phar- 
macies facilitated by search-redirection attacks, we as- 
sume that the share of visits websites receive is pro- 
portional to the number of URIs that turn up in the 
search results. Given that 38% of the search results we 
found pointed to infected websites, we might expect that 
the monthly share of visits to these sites facilitated by 
Google searches to be around 13 million. Google re- 
portedly has a 64.4% market share in search [13]. Con- 
sequently we expect that the traffic arriving from other 
search engines to be Leet * 13 million = 7 million. 

We manually visited 150 pharmacy websites identified 
in our study and added drugs to shopping carts to observe 
the beginning of the payment process. We found that 94 
of these websites in fact pointed to one of 21 different 
payment processing websites. These websites typically 
had valid SSL certificates signed by trusted authorities, 
which helps explain why multiple pharmacy storefronts 
may want to share the same payment processing website. 

The fact that these websites are only used for payment 
processing means that if we could measure the traffic to 
these websites, then we could roughly approximate how 
many people actually purchase drugs from these pharma- 
cies. Fortunately for us, these websites receive enough 
traffic to be monitored by services such as Alexa. We 
tallied Alexa’s estimated daily visits for each of these 
websites; in total, they receive 855 000 monthly visits. 

We next checked whether these payment websites also 
offered payment processing other than just for pharmacy 
websites. To check this, we fetched 1 000 backlinks for 
each of the sites from Yahoo Site Explorer [6]. Col- 
lectively, 1561 domains linked in to the payment web- 
sites. From URI naming and manual inspection, we de- 
termined that at least 1 181 of the backlink domains, or 
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75%, are online pharmacies. This suggests that the pri- 
mary purpose of these websites is to process payments 
for online pharmacies. 


Taken together, we can use all the information dis- 
cussed above to provide a lower bound on the sales con- 
version rate of pharmacy web search traffic: 


0.75 x 855 000 


300000007 2”: 


Conversion 


To ensure that the estimate is a lower bound for the 
true conversion rate, whenever there is uncertainty over 
the correct figures, we select smaller estimates for fac- 
tors in the numerator and larger estimates for factors in 
the denominator. For example, it is possible that the esti- 
mate of visits to payment sites is too small, since pharma- 
cies could use more than the 21 websites we identified to 
process payments. A more accurate estimate here would 
strictly increase the conversion rate. Similarly, 20 mil- 
lion visits to search-redirection websites may be an over- 
estimate, if, for instance, more popular search queries 
suffer from fewer search-redirection attacks. Reducing 
this estimate would increase the conversion rate since the 
figure is in the denominator. 


There is likely one slight overestimate present in the 
numerator. It is not certain that every single visitor to a 
payment processing site eventually concluded the trans- 
action. However, because these sites are only used to 
process payments, we can legitimately assume that most 
visitors ended up purchasing products. Even with a con- 
servative assumption that only | in 10 visitors to the pay- 
ment processing site actually complete a transaction, the 
lower bound on the conversion rates we would obtain (in 
the order of 0.3%) far exceeds the conversion rates ob- 
served for email spam [22] or social-network spam [17]. 


While email spam has attracted more attention, our 
research suggests that more illicit pharmacy purchases 
are facilitated by search-redirection attacks than by email 
spam. One study estimated that the entire Storm bot- 
net (which accounted for between 20-30% of email 
spam at its peak [12,37]) attracted around 2 100 sales 
per month [22]. The payment processing websites tied 
to search-redirection attacks collectively process many 
hundreds of thousands of monthly sales. Even allowing 
for the possibility that these websites may also process 
payments for pharmacies advertised through email spam, 
the bulk of sales are likely dominated by referrals from 
web search. This is not surprising, given that most peo- 
ple find it more natural to turn to their search engine of 
choice than to their spam folder when shopping online. 
To those who aim to reduce unauthorized pharmaceutical 
sales, the implication is clear: more emphasis on combat- 
ing transactions facilitated by web search is warranted. 
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6 Mitigation strategies 


The measurements we gathered lead us to consider three 
complementary mitigation strategies to reduce the im- 
pact of search-redirection attacks. One can target the in- 
fected sources, advocate search-engine intervention, or 
try to disrupt the affiliate networks. 


Remediation at the sources. The existing public-private 
partnership initiated by the White House [19] has so 
far focused on areas other than search-redirection at- 
tacks. Domain name registrars (led by GoDaddy) can 
shut down maliciously registered domains, while Google 
has focused on blocking advertisements (but not neces- 
sarily search results) from unauthorized pharmacies. Un- 
fortunately, no single entity speaks for the many web- 
masters whose sites have unknowingly been recruited to 
drive traffic to illicit pharmacies. 

Nonetheless, eradicating source infections at key web- 
sites could be effective. As shown in Figure 3, a small 
number of source infections repeatedly appear towards 
the top of the search results. Remediating only the most 
frequently-occurring websites could substantially reduce 
sales. Furthermore, attackers would likely struggle to 
adapt to the heightened enforcement. Placing websites 
at high-ranking search positions through search-engine 
optimization is a slow process, given that the search en- 
gine controls the rankings-update cycle. Second, high- 
ranking websites that can permeate the top levels of 
search results are fairly scarce resources, so that any co- 
ordinated reduction is likely to be painful for pharmacies. 

How might an enforcement agent select which web- 
sites to target for remediation? Again, our findings are 
informative. The survival analysis in Section 4.2 indi- 
cates that websites with high PageRank or .edu TLDs 
are more persistent. A simple heuristic, then, would be 
for an agent to run a few search queries for drug terms 
and try to clean up any .edu or high-ranking website 
that appears in multiple results. 


Search-engine intervention. In the absence of direct 
law enforcement involvement in remediating source in- 
fections, search engines could play a more active role in 
detecting search-redirection attacks and blocking them 
from search results. Google already blocks websites that 
are known to be distributing malware [40], and recently 
began including warnings on websites believed to be 
compromised. From anecdotal inspection, several source 
websites participating in search-redirection attacks now 
carry the warning. Users are still free to visit the compro- 
mised website, however, so those seeking to buy drugs 
without a prescription may still find willing sellers. We 
encourage search engines to consider dropping such re- 
sults altogether, given the illegal activity that is being di- 
rectly facilitated. 
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Disrupting the redirection network. The high degree 
of interconnection of the different sites we observed in 
Section 4 suggests that monetary profits come from fun- 
neling traffic between different affiliates. One can thus 
conjecture that disrupting the connectivity of the net- 
work we observed would have adverse economic conse- 
quences for the miscreants. Can this be easily achieved? 

As described in Section 4, while the network of phar- 
macies, sources, and redirectors is almost completely in- 
terconnected, there is a comparatively small number of 
nodes in the network that redirect traffic from one host 
to the next and play a central role in the drug trade. 
Specifically, taking down any of 103 redirectors would 
break up the large network of affiliates we observed, and 
could have strong disruptive effects on the profits made 
by advertisers. Of course, we would expect attackers 
to quickly move redirectors to different hosts after take- 
downs — and in fact, have, over the long measurement 
interval we consider, evidence that this sometimes hap- 
pens. Nevertheless, the currently long lifetime of redirec- 
tors indicates that defenders could act more forcefully. 

Perhaps even more interestingly, we were able to find 
BGP Autonomous System (AS) information for 84 of 
the 127 redirectors with in- and out-degrees greater than 
two;> of these, 53 (or 63%) belong to one of only 11 dis- 
tinct ASes.° In other words, a very limited number of 
infrastructure providers appear to play an important role 
in the illicit online drug trade. Likewise, we were able 
to identify domain name registrars for 73 of the redirec- 
tor domains; 49 of these domains belong to one of only 
5 registrars (ENOM and GoDaddy, which is expected 
given their market share, but also “A to Z Domains Solu- 
tions,” “BizCN,” and “Directi Internet Solutions,” which 
are far more represented in this sample than their market 
share would warrant). 

Determining whether these hosting providers and reg- 
istrars are willing participants or simply have lax host- 
ing practices is beyond the scope of our investigation. 
However, by strengthening their controls, these service 
providers could probably make it harder to operate redi- 
rectors, thereby yielding tangible benefits in combating 
illicit online drug trade. Should these registrars and 
hosting providers take action, we would certainly expect 
the miscreants to adapt, and move to different providers 
(e.g., bulletproof hosting); but, it is likely that these al- 
ternative solutions would be more financially costly than 
what is currently used, which in turn would reduce the 
profit margins miscreants enjoy. In the end, making il- 
licit online commerce an unattractive economic proposi- 


5The remaining 43 redirectors had gone offline when we ran this 
experiment in February 2011. 

6Many nodes in a given community are hosted on the same AS, 
giving additional evidence that the community detection algorithm dis- 
cussed in Section 4 is quite accurate. 
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tion could be the strongest deterrent to such activities. 
In sum, any subset of source-infection remediation, 
search-engine filtering, and redirector take-down would 
make it more difficult for miscreants to conduct their 
business. Combining these mitigations would likely 
cause significant hardship to the criminal networks in 
play and would help thwart the illicit online trade of phar- 
maceutical drugs (and of other counterfeit goods). 


7 Related work 


The shift observed in the past decade, from Internet and 
computer security attacks motivated by fame and reputa- 
tion to attacks motivated by financial gain [30], has led 
to a number of measurement studies that quantify vari- 
ous aspects of the problem, and to motivate possible in- 
tervention policies by quantitative analysis. Due to the 
amount of network measurement literature available, we 
focus here on work most closely related to this paper. 

Many studies, e.g., [7, 22, 24, 50], have focused on 
email spam, describing the magnitude of the problem in 
terms of network resources being consumed, as well as 
some of its salient characteristics. Two key take-away 
points are that spam is a game of very large numbers, 
and that it is not a very effective technique to adver- 
tise products, as observed conversion rates (fraction of 
email spam that eventually result in a sale) are small. 
As pointed out earlier, spamming techniques are how- 
ever evolving and increase their effectiveness by better 
targeting potential customers, as described by the recent 
flurry of spam observed in social networks [17]. 

A very recent paper by Levchenko et al. [24] provides 
a thorough investigation of the different actors partici- 
pating spamming campaigns, from the spammers them- 
selves, to the suppliers of illicit goods (luxury items, soft- 
ware, pharmaceutical drugs, ...). The key difference with 
the present study is that Levchenko et al. are focusing 
on businesses advertising by spam, while we are looking 
into search-engine manipulation. The data we gathered 
(see Section 4.4) seems to suggest that, so far, the two 
sets of miscreants remain relatively disjoint, but that ad- 
vertising based on search engine manipulation is on the 
rise (see Section 3.3). 

Measurement studies of spam have also informed pos- 
sible intervention policies, by identifying some infras- 
tructure weaknesses. For instance, taking down a few 
servers from suspicious Internet Service Providers [9] 
can significantly reduce the overall volume of email 
spam. Infiltration of spam-generating botnets, as sug- 
gested by [39], has also been shown to be effective in 
designing much more accurate spam filtering rules. 

A series of papers by Moore and Clayton [27, 29, 31] 
investigates the economics of phishing, and show inter- 
esting insights on the tactics phishers use to evade detec- 
tion. A further outcome of this line of research is a set of 
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recommended intervention techniques to combat phish- 
ing, e.g., applying economic pressure on DNS registrars. 
The present paper borrows some of the techniques (use 
of Webalizer data, lifetime computation) used for phish- 
ing measurements, as they apply as well to measurement 
of online pharmacy activity (see Section 3). 

A separate branch of research has focused on eco- 
nomic implications of online crime. Thomas and Mar- 
tin [45], Franklin et al. [16] and Zhuge et al. [51] pas- 
sively monitor the advertised prices of illicit commodi- 
ties exchanged in varied online environments (IRC chan- 
nels and web forums). They estimate the size of the mar- 
kets associated with the exchange of credit card num- 
bers, identity information, email address databases, and 
forged video game credentials. Christin et al. [8] mine 
online forum data to assess the economic impact of a so- 
cial engineering attack pervasive on Japanese-language 
websites, and to identify some of the key characteristics 
of the network of perpetrators behind these scams. 

More closely related to the attack described here, 
Ntoulas et al. [34] measure search engine manipulation 
attacks, and Wang et al. [47] show the connection be- 
tween web and email spam, and online advertisers. 

The medical literature has been preoccupied with il- 
licit online pharmacies for a few years, but has mostly 
looked at smaller data samples, and has solely focused 
on the retail side rather than the entire infrastructure sup- 
porting this commerce. As examples, Henney et al. in- 
vestigated the credentials of 37 online pharmacies [18]. 
Littlejohn et al. [25] focused on a slightly larger sample 
of 275 websites, to primarily inform the socio-economic 
impact of Internet availability on drug abuse. Likewise, 
we are not the first to evidence the existence of adver- 
tising affiliate networks, which have been previously de- 
scribed informally (see, e.g., [44]). 

We believe that the work presented in this paper is the 
first to provide a detailed analysis of search-redirection 
attacks, and to substantiate their use with a quantitative 
analysis of the overall magnitude of the illicit online pre- 
scription drug trade. Further, we obtain both an under- 
standing of the structure of the miscreants’ networks, and 
an idea of the conversion rates they can expect. In that 
respect, our measurements may be a useful starting point 
for a more thorough quantitative economic analysis. 


8 Conclusions and future work 


Given the enormous value of web search, it is no sur- 
prise that miscreants have taken aim at manipulating its 
results. We have presented evidence of systematic com- 
promise of high-ranking websites that have been repro- 
grammed to dynamically redirect to online pharmacies. 
These search-redirection attacks are present in one third 
of the search results we collected. The infections per- 
sist for months, 96% of the infected hosts are connected 
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through redirections, and a few collections of redirec- 
tors are critical to the connection between source infec- 
tions and pharmacies. We have also observed that legit- 
imate businesses are nearly absent from the search re- 
sults, having been completely drawn out of the search 
results by blog and forum spam and compromised web- 
sites. We also offer a conservative estimate of between 
0.3% and 3% conversion rate of searches for drugs turn- 
ing into sales, which should motivate the pressing need 
for countermeasures. Fortunately, we are optimistic that 
the criminals behind search-redirection attacks could be 
disrupted with targeted interventions due to the high con- 
centrations we observed empirically. 

In terms of immediate future work, there is nothing in- 
herent to the search-redirection attack suggesting it only 
applies to online pharmacies. Even though counterfeit 
drugs are the most pressing issue to deal with due to their 
inherent danger, other purveyors of black-market goods, 
such as counterfeit software, or luxury goods replicas, 
might also hire affiliates that manipulate search results 
with infected websites for advertising purposes. 

We ran a brief (12 days) pilot experiment to assess how 
search-redirection attacks applied to counterfeit software 
in October 2010. After collecting results from 466 
queries, created using input from Google Adwords Key- 
word Tool, we gathered 328 infected source domains, 
72 redirect domains and 140 domains selling counter- 
feit software. Using the same clustering techniques de- 
scribed earlier in the paper, we discovered two connected 
components dominating the network, each in its own 
way: one component was responsible for 44% of the 
identified infections, and the other was responsible for 
30% of the software-selling sites. 

We also observed a small but substantial (12.5%) over- 
lap in the set of redirection domains with those used 
for online pharmacies. Some redirection domains thus 
provide generic traffic redirection services for different 
types of illicit trade. However, the small overlap is also 
a sign of fragmentation among the different fraudulent 
trading activities. We have begun a longitudinal study 
of all retail operations benefiting from search-redirection 
attacks, in order to better understand the economic rela- 
tionships between advertisers and resellers. 

Systematic monitoring of web search results will 
likely become more important due to the value miscre- 
ants have already identified in manipulating outcomes. 
Indeed, this paper has shown that understanding the 
structure of the attackers’ networks gives defenders a 
strong advantage when devising countermeasures. 
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A Additional query-sample validation 


We have collected two sets of additional search queries 
to compare to our main corpus of 218 terms. First, we 
have derived a query set from an exhaustive list of 9000 
prescription drugs provided by the US Food and Drugs 
Administration [15]. We ran a single query in the form 
of “no prescription [drug name]” and collected the first 
64 results for each drug in the list. We executed the 9 000 
queries over five days in August 2010. About 2500 of 
the queries returned no search results. Of the queries that 
returned results, we observed redirection in at least one 
of the search results for 4 350 terms. 

For the second list, we inspected summaries of server 
logs for 169 infected websites to identify drug-related 
search terms that redirected to pharmacies. We obtained 
this information from infected web servers running The 
Webalizer,’ which creates monthly reports, based on 
HTTP logs, of how many visitors a website receives, the 
most popular pages on the website, and so forth. It is not 
uncommon to leave these reports “world-readable” in a 
standard location on the server, which means that anyone 
can inspect their contents. 

In August 2010, we checked 3 806 infected websites 
for Webalizer, finding it accessible on 169 websites. 
We recorded all available data — which usually included 
monthly reports of activity up to and including the cur- 
rent month. One of the individual sub-reports that We- 
balizer creates is a list of search terms that have been 
used to locate the site. Not all Webalizer reports list 
referrer terms, but we found 83 websites that did in- 
clude drug names in the referrer terms for one or more 
months of the log reports. Since we identified the in- 
fected servers running Webalizer by inspecting results of 
the 218 queries from our main corpus, it is unsurpris- 
ing that 98 of these terms appeared in the logs. However, 
the logs also contained an additional 1 179 search queries 
with drug terms. We use these additional search terms as 
an extra queries list to compare against the main corpus. 

We collected the top 64 results for the extra queries list 
daily between October 20 and 31, 2010. When compar- 
ing these results to our main query corpus, we examine 
only the results obtained during this time period, result- 
ing in a significantly smaller number of results than for 
our complete nine-month collection. 

We compare our main list to the additional lists in 
three ways. First, we compare the classification of search 
results for differences in the types of results obtained. 
Second, we compare the distribution of TLD and PageR- 
ank for source infections obtained for both samples. 
Third, we compute the intersection between the domains 
obtained by both sets of queries for source infections, 
redirects and pharmacies. 


Thttp://www.mrunix.net/webalizer/ 
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FDA drug list Extra query list 
Drug list Main list Extra list Main list 
URIs dom. URIs dom. URIs dom. URIs dom. 





Search result classification 

Source infections 24.7 40 43.7 224 356 140 49.3 27.9 
Health resources 12.7 7.4 2.8 3.5 49 4.2 2.4 3.0 
Legit. pharm. 0.5 0.1 0.03 0.07 0.1 0.1 0.02 0.05 
Illicit pharm. 6.7 6.9 8.2 13.6 6.1 11.6 6.5 12.0 
Blog/forum spam 25.4 23.7) 18.6 «17.8 26.3) 22.7) 17.8 17.7 
Uncategorized 30.1 57.9 26.7 42.7 27.2 469 240 39.4 





Source infection TLD breakdown 





-com 60.0 56.9 56.3 54.6 
org 13.8 17.0 15.4 18.0 
edu 5.6 8.9 6.2 9.3 
-net 6.1 5.6 5.6 4.6 
other 14.3 11.5 16.5 13.5 
Source infection PageRank breakdown 

PRO <3 47.2 35.0 47.5 41.9 
PR3 <6 41.4 51.3 44.2 46.3 
PR>7 11.4 13.7 8.3 11.8 





Table 4: Comparing different lists of search terms to the 
main list used in the paper. All numbers are percentages. 


Table 4 compares the FDA drugs and extra queries lists 
to the main list. The breakdown of search results for both 
samples is slightly different from what we obtained us- 
ing the main queries. For instance, only 25% of the URIs 
in the FDA results are infections, compared to 44% for 
the main list during the same time period. 13% of the 
results in the FDA drug list point to legitimate health re- 
sources, compared to only 3% of the main sample. This 
is not surprising, given that the drug list often included 
many drugs that are not popular choices for sales by on- 
line pharmacies. [llicit pharmacies appear slightly less 
often in the drugs sample (6% vs. 8%), while blog and 
forum spam is more prevalent (25% to 19%). 

The extra queries list follows the FDA list in some 
ways, e.g., more blog infections and fewer source infec- 
tions than results from the corresponding main list. On 
the other hand, the URI breakdown in health resources 
is much closer (4.9% vs. 2.4%). In all samples, the 
number of results that point to legitimate pharmacies is 
very small, though admittedly biggest in the drugs sam- 
ple (0.5% vs. 0.1% for the extra queries). 

We next take a closer look at the characteristics of the 
source infections themselves. The TLD breakdown is 
roughly similar, with a few exceptions. .com is found 
slightly more often in the FDA drugs and extra queries 
results, while .org and .edu appear a bit more often 
in the results for the main sample. The drugs and extra 
queries list tend to have slightly lower PageRank than the 
results from the main sample, but the difference is slight. 


B_ Estimating the number of sites involved 


We also wish to compare the number of attack domains 
that can be identified for different sets of queries. Fig- 
ure 7 compares the overlap between each class of do- 
mains for the different samples. The FDA drugs queries 
identified 1919 distinct source infections, compared to 
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Figure 7: Comparing the source, redirect and pharmacy 
domains observed for different query lists. 


1 337 found in the main sample during the same time pe- 
riod. 403 infected domains appeared in both lists. 

It is unreasonable to expect any single query list to be 
comprehensive and identify all attack websites. In both 
of our test cases, we compared much larger query cor- 
pora to a smaller list (6500 and 1 179 versus 218). De- 
spite this, in each case many domains were found exclu- 
sively in the results of the smaller main sample. This is a 
common outcome when trying to measure online attacks 
such as phishing websites [28]. 

Given the difficulty in getting a truly comprehensive 
query list, one alternative is to estimate the total number 
of affected domains to get a better sense of an attack’s 
impact. We apply capture-recapture analysis [21] based 
on our incomplete samples to get an estimate of the mag- 
nitude of the activity studied in this paper. 

Capture-recapture analysis uses repeated sampling to 
estimate populations. In its simplest form, a sample 5} 
is taken, then replaced into the population. A second 


sample S> is taken, and the population can be estimated 
Pp — !811x1S2l 
|SinS2| : 


For the capture-recapture model to be perfectly accu- 
rate, a number of assumptions must apply. Notably, the 
population must be homogeneous and closed (i.e., no 
new entries). These assumptions do not entirely hold 
for our analysis: some websites are more likely to ap- 
pear in the search results than others, and websites can 
be added and removed frequently. Nonetheless, we have 
computed the capture-recapture estimate in order to get 
a first approximation of the greater population size. The 
results are given in Figure 7. Notably, the estimates for 
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source infections and redirects generated by comparing 
the different samples are fairly close. Both predict that 
the true number of redirects to be near 500, and the num- 
ber of source infections to be around 5000-6000. The 
estimates for the number of pharmacies is more diver- 
gent, with one predicting a population size of 2523 and 
the other predicting 795. 
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Abstract 


We perform an in-depth study of SEO attacks that 
spread malware by poisoning search results for popular 
queries. Such attacks, although recent, appear to be both 
widespread and effective. They compromise legitimate 
Web sites and generate a large number of fake pages 
targeting trendy keywords. We first dissect one exam- 
ple attack that affects over 5,000 Web domains and at- 
tracts over 81,000 user visits. Further, we develop de- 
SEO, a system that automatically detects these attacks. 
Using large datasets with hundreds of billions of URLs, 
deSEO successfully identifies multiple malicious SEO 
campaigns. In particular, applying the URL signatures 
derived from deSEO, we find 36% of sampled searches 
to Google and Bing contain at least one malicious link in 
the top results at the time of our experiment. 


1 Introduction 


The spread of malware through the Internet has increased 
dramatically over the past few years. Along with tradi- 
tional techniques for spreading malware (such as through 
links or attachments in spam emails), attackers are con- 
stantly devising newer and more sophisticated methods 
to infect users. A technique that has been gaining preva- 
lence of late is the use of search engines as a medium for 
distributing malware. By gaming the ranking algorithms 
used by search engines through search engine optimiza- 
tion (SEO) techniques, attackers are able to poison the 
search results for popular terms so that these results in- 
clude links to malicious pages. 

A recent study reported that 22.4% of Google searches 
contain such links in the top 100 results [23]. Further- 
more, it has been estimated that over 50% of popular key- 
word searches (such as queries in Google Trends [9] or 
for trending topics on Twitter [20]), the very first page of 
results contains at least one link to a malicious page [19]. 

Using search engines is attractive to attackers because 
of its low cost and its legitimate appearance. Malicious 
pages are typically hosted on compromised Web servers, 
which are effectively free resources for the attackers. As 
long as these malicious pages look relevant to search en- 
gines, they will be indexed and presented to end users. 
Additionally, users usually trust search engines and often 
click on search results without hesitation, whereas they 
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would be wary of clicking on links that appear in unso- 
licited spam emails. It is therefore not surprising that, 
despite being a relatively new form of attack, search- 
result poisoning is already a huge phenomenon and has 
affected major search engines. 

In this paper, we aim to uncover the mechanics of such 
attacks and answer questions such as how attackers com- 
promise a large number of Web sites, how they auto- 
matically generate content that looks relevant to search 
engines, and how they promote their malicious pages to 
appear at the top of the search results. 

In order to answer these questions, we examine a live, 
large-scale search poisoning attack and study the meth- 
ods used by the attackers. This attack employs over 
5,000 compromised Web sites and poisons more than 
20,000 popular search terms over the course of several 
months. We investigate the files and scripts that attack- 
ers put up on these compromised servers and reverse- 
engineer how the malicious pages were generated. 

Our study suggests that there are two important re- 
quirements for a search-result poisoning attack to be suc- 
cessful: the use of multiple (trendy) keywords and the 
automatic generation of relevant content across a large 
number of pages. Since trendy keywords are often pop- 
ular search terms, poisoning their search results can af- 
fect a large user population. Further, by generating many 
fake pages targeting different keywords, attackers can ef- 
fectively increase their attack coverage. 

Based on these observations, we develop techniques 
to automatically detect search-result poisoning attacks. 
Although there exist methods for identifying malicious 
content in individual Web pages [14, 22], these solutions 
are not scalable when applied to tens of billions of Web 
pages. Further, attackers can leverage cloaking tech- 
niques to display different content based on who is re- 
questing the page—malicious content to real users and 
benign, search-engine-optimized content to search en- 
gine crawlers. Therefore, instead of detecting individual 
SEO pages, we identify groups of suspicious URLs— 
typically containing multiple trendy keywords in each 
URL and exhibiting patterns that deviate from other 
URLs in the same domain. This approach not only is 
more robust than examining individual URLs, but also 
can help identify malicious pages without crawling and 
evaluating their actual contents. 

Using this approach, we build deSEO, a system 
that automatically detects search-result poisoning attacks 
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without crawling the contents of Web pages. We apply 
deSEO to two datasets containing hundreds of billions of 
URLs collected at different periods from Bing. Our key 
results are: 


1. deSEO detects multiple groups of malicious URLs, 
with each malicious group corresponding to an SEO 
campaign affecting thousands of URLs. 


2. deSEO is able to detect SEO campaigns that employ 
sophisticated techniques such as cloaking and have 
varying link structures. 


3. We derive regular expression signatures from de- 
tected malicious URL groups and apply them to 
search results on Google and Bing. The signatures 
detect malicious links in the results to 36% of the 
searches. At the time our experiments, these links 
were not blocked by either the Google Safebrows- 
ing API or Internet Explorer. 


The rest of the paper is structured as follows. We be- 
gin with describing the background for SEO attacks and 
reviewing related work in Section 2. Next, we investigate 
a large scale attack in detail in Section 3. Based on the 
insights gained from the attack analysis, we present the 
deSEO detection system in Section 4. In Section 5, we 
apply deSEO to large datasets and report the results. We 
analyze the detected SEO groups and apply the derived 
signatures to filter search results in Section 6. Finally, we 
conclude in Section 7. 


2 Background and Related Work 


Search engines index billions of pages on the Web. Many 
modern search engines use variants of the PageRank al- 
gorithm [17] to rank the Web pages in its search index. 
The rank of a page depends on the number of incoming 
links, and also on the ranks of the pages where the links 
are seen. Intuitively, the page rank represents the likeli- 
hood that a user randomly clicking on links will end up 
at that page. 

In addition to the rank of the page, search engines 
also use features on the page to determine its relevance 
to queries. In order to prevent spammers from gam- 
ing the system, search engines do not officially disclose 
the exact features used to determine the rank and rele- 
vance. However, researchers estimate that over 200 fea- 
tures are used [3,6]. Among these features, the most 
widely known ones are the words in the title, the URL, 
and the content of the page. The words in the title and 
in the URL are given high weight because they usually 
summarize the content of the page. 

Search Engine Optimization (SEO) is the process of 
optimizing Web pages so that they are ranked higher by 
search engines. SEO techniques can be classified as be- 
ing white-hat or black-hat. 
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In white-hat SEO, the sites are created primarily with 
the end-user in mind, but structured so that search engine 
crawlers can easily navigate the site. Some of the white- 
hat techniques are creating a sitemap, having appropriate 
headings and subheadings, etc. They follow the quality 
guidelines recommended by search engines [8, 29]. 


Black-hat SEO techniques, on the other hand, try to 
game the rankings, and do not follow the search engine 
guidelines. Keyword stuffing (filling the page with lots 
of irrelevant keywords), hidden text and links, cloak- 
ing (providing different content to crawlers and users), 
redirects, and participating in link farms are considered 
black-hat techniques. These practices are frowned upon 
by the search engines, and if a site is caught using such 
techniques, it could be removed from the search index. 


To detect black-hat SEO pages, many approaches have 
been proposed. Some are based on the content of the 
pages [15,5,21], some are based on the presence of 
cloaking [25,27], while some others are based on the 
link structure leading to the pages [26,4]. 


The SEO attacks that we study in this paper are differ- 
ent from traditional ones in that attackers leverage a large 
number of compromised servers. Since these servers 
were originally legitimate and their main sites still op- 
erate normally even after compromise, they display a 
mixed behavior and therefore are harder to detect. 


Our detection methods make use of URL properties 
to detect malicious pages without necessarily crawling 
the pages. In this respect, our work is similar to pre- 
vious work by Ma et al. [13, 12], where they build a 
binary classifier to identify email spam URLs without 
crawling the corresponding pages. The classifier uses 
training data from spam emails. The SEO attacks we 
study are very new and there are few reports on spe- 
cific instances of such attacks [10]. Therefore, it is dif- 
ficult to get training data that has good coverage. In ad- 
dition, spam URLs have different properties than SEO 
URLs. Many spam domains are new and also change 
DNS servers frequently. Therefore, their system makes 
use of domain-level features such as age of the domain 
and the DNS-server location. Since we deal with com- 
promised domains, there are no such strong features. 


A recent analysis of over 200 million Web pages 
by Google’s malware detection infrastructure discovered 
nearly 11,000 domains that are being used to serve mal- 
ware in the form of FakeAV software [18]. This work 
looks at the prevalence and growth of FakeAV as a means 
for delivering malware. Our work, on the other hand, 
looks at the mechanisms used by the perpetrators to game 
search engines for the effective delivery of this kind of 
malware. By developing methods to detect SEO attacks, 
we also detect a large number of compromised domains, 
but without having to inspect them individually. 
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An overview of how the attack works. The victim issues a popular query to a search engine (1), 


and clicks one of the results, which happens to be a malicious page hosted on a compromised server (2). The 
compromised server forwards the request to a redirection server (3). The redirection server picks an exploit 
server and redirects the victim to it (4). The exploit server tries to exploit the victim’s browser or displays a 
scareware page (5) to infect the victim through social engineering. 


3 Dissecting an SEO Attack 


In order to gauge the prevalence of search poisoning at- 
tacks, we pick a handful of trendy search terms and is- 
sue queries on Google and Bing. Consistent with pre- 
vious findings, we find that the results to around 36% 
of the search results contain malicious links (i.e., links 
that redirect to pages serving malware), with many of the 
links appearing on the first page of results. 

Figure 1 shows, from a legitimate user’s perspective, 
how a victim typically falls prey to an SEO keyword- 
poisoning attack. The attackers poison popular search 
terms so that their malicious links show up in the search 
results for those terms. When the victim uses a search en- 
gine to search for such popular terms, some of the results 
would point to servers controlled by attackers. These are 
usually legitimate servers that have been compromised 
by the attackers and used to host SEO pages. Clicking 
on the search results leads to an SEO page that redirects, 
after multiple hops, to an exploit server that displays a 
scareware page. For instance, the scareware page might 
depict an anti-virus scan with large flashy warnings of 
multiple infections found on the victim system, scaring 
the user into downloading and installing an “anti-virus” 
program. The exploit servers could also try to directly 
compromise the victim’s browser. 

To understand exactly how these malicious links end 
up highly ranked in the search results for popular queries, 
we pick a few malicious links and examine them closely. 
Our first observation is that the URLs have similar 
structure—they all correspond to php files and the search 
terms being poisoned are present in the URL as argu- 
ments to the php file. The SEO page contains content 
related to the poisoned terms, and also links to URLs of 
a similar format. These URLs point to SEO pages on 
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other domains that have also been compromised by the 
same group of attackers. By crawling these links succes- 
sively till we reach a fixed point, i.e., till we see no more 
new links, we can identify the entire set of domains in- 
volved in a search poisoning attack. 

In the rest of this section, we study one particular 
SEO attack, which started in August 2010, was active for 
around 10 weeks, and included nearly 37 million SEO 
pages hosted on over 5,000 compromised servers. Ana- 
lyzing the php script that generates the SEO page gives us 
greater insight into the mechanics of this attack. Usually, 
the source of the php files cannot be obtained directly 
since accessing the file causes the Web server to execute 
the php commands and display the output of the execu- 
tion. In this case, however, we found misconfigured Web 
servers that did not execute the files, but instead allowed 
us to download the sources. By examining the source 
files and log files (the locations of the log files were ob- 
tained from the source php file) stored by attackers on 
the Web server, we get a better understanding of the at- 
tack. Note that all the files we examined were publicly 
accessible without the use of any passwords. 

Relying on all of this publicly accessible informa- 
tion, we examine the techniques used by the attackers 
and identify patterns that help detect other similar at- 
tacks. There are three major players in this attack: com- 
promised Web servers, redirection servers, and exploit 
servers. We discuss each of them in detail next. 


3.1 Compromised Web servers 


Finding vulnerable servers 
The servers were likely compromised through a vulner- 
ability in osCommerce [16], a Web application used to 
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manage shopping sites. We believe the exploit happened 
through osCommerce because all of the compromised 
sites were running the software, and the fake pages were 
set up under a directory belonging to osCommerce. Ad- 
ditionally, this software has several known vulnerabili- 
ties that have remained unpatched for several years, so is 
a rather easy target for an attacker. Also, the databases 
associated with shopping sites are likely to store sensitive 
information such as mailing addresses and credit card de- 
tails of customers. This offers an additional incentive for 
attackers to target these sites. We believe that vulner- 
able servers running osCommerce are discovered using 
search engines. Attackers craft special queries designed 
to match the content associated with these Web services 
and issue these queries to search engines such as Bing or 
Google to find Web servers running this software [11]. 


Compromising vulnerable servers 

How does the compromise happen? Surprisingly, 
this is the easiest part of the whole operation. The 
primary purpose of compromising the site is to store 
and serve arbitrary files on the server, and to execute 
commands on the server. With vulnerable installs of 
osCommerce, this is as easy as going to a specific URL 
and providing the name of the file to be uploaded. For 
example, if www.example.com/store is the site, 
then visiting www.example.com/store/admin/ 
file_manager.php/login.php?action= 
processuploads and specifying a filename as a 
POST variable will upload the corresponding file to the 
server. 


Hosting malicious content 

Typically, attackers upload php scripts, which allow them 
to execute commands on the compromised machine with 
the privilege of the Web server (e.g., Apache). In many 
cases, attackers upload a graphical shell or a file man- 
ager (also written in php), so that they can easily navi- 
gate the files on the server to find sensitive information. 
The shell includes functions that make it easy for the at- 
tackers to perform activities such as a brute-force attack 
on /etc/passwd, listening on a port on the server, or 
connecting to some remote address. 

In our case, the attacker uploads a simple php script, 
shown in Figure 2. This file is added to the images/ 
folder and is named something inconspicuous, so as to 
not arouse the suspicion of the server administrator. This 
script allows the attacker to either run a php command, 
run a system command, or upload a file to the server. 
A newer version of the script (seen since October 9th, 
2010) additionally allows the attacker to change the per- 
missions of a file. 


Once this script is in place, the attacker can add files 
to the server for setting up fake pages that will be in- 
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<?php 
$e=@$ POST['e']; 
$s=@$ POST['s']; 
if(Se) { 
eval (Se); 


} 
if($s) { 
system(Ss); 
} 
if($_FILES['f']['name']!='') { 
move_uploaded_file( 
$_FILES['f']['tmp_name'], 
$_FILES['f']['name']); 
} 


a 


Figure 2: The php script uploaded by the attackers to 
the compromised server. 


dexed by search engines. These files include an html 
template, a CSS style file, an image that looks like a 
YouTube player window, and a php script (usually named 
page. php) that puts all the content together and gener- 
ates an html page using the template. The URLs to the 
pages set up by the attackers are of the form: 
site/images/page.php?page=<keyphrase> 

The set of valid keyphrases is stored in another file 
(key.txt), which is also uploaded by the attackers. 
Most of the keyphrases in the file are obtained from 
Google hot trends [9] and Bing related searches. 

In some other attacks, we observe that the attackers 
make use of cloaking techniques [25,27] while delivering 
malware, i.e., they set up two sets of pages and provided 
non-malicious pages to search engine bots, while serving 
malicious pages to victims. In this specific attack, how- 
ever, the attackers do not use cloaking. Instead, the same 
page is returned to both search engines and regular users, 
and the page makes use of javascript and flash to redirect 
victims to a different page. The redirection is triggered 
by user actions (mouse movement in this case). The ra- 
tionale here is that search engine crawlers typically do 
not generate user actions, so will not know that visitors 
will be redirected to another URL. Using such flash code 
for redirection makes detection much harder. 


The SEO page 

The bulk of the work in creating the SEO page and links 
is done by the page.php script uploaded to the server. 
This is an obfuscated php script, and like many obfus- 
cated scripts, it uses a series of substitution ciphers fol- 
lowed by an eval function to execute the de-obfuscated 
code. By hooking into the eval function in php, we get 
the unobfuscated version. The script performs three ac- 
tivities: 


1. Check if search engine: | When the page is re- 
quested, the script first checks if the request is from 
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a search engine crawler. It does this by checking 
the user-agent string against a list of strings used 
by search engines. If the request is from a search 
crawler, the script logs the time of the request, the IP 
address of the requester, the user-agent string, and 
the exact URL requested. Since this attack does not 
use any cloaking, this check seems to be only for 
logging purposes. 


2. Generate links: The script loads the html tem- 
plate, and fills in the title and other headings us- 
ing the keyphrase in the URL. It picks 40 random 
keyphrases from key.txt and generates links to 
the same server using these keyphrases. It then 
picks five other keyphrases from key.txt and 
generates links to five other domains (randomly 
picked from a set of other domains that have also 
been compromised by the attacker). In all, there are 
45 links to similar pages hosted on this and other 
compromised servers. 


3. Generate content: Finally, the script also gen- 
erates content that is relevant to the keyphrase in 
the URL. It does this with the help of search en- 
gines. It queries google.com for the keyphrase, 
and fetches the top 100 results, including the URLs 
and snippets. It also fetches the top 30 images from 
bing.com for the same keyphrase. The script then 
picks a random set of 10 URLs (along with associ- 
ated snippets) and 10 images and merges them to 
generate the content page. 


The content generated for each keyphrase is stored on 
the server in a cached file, and all subsequent requests for 
the page are satisfied from the cache, without having to 
regenerate the content. We believe that the presence of 
highly relevant information on the page, along with the 
dense link structures, both within the site and across dif- 
ferent compromised sites, result in increasing the pager- 
anks of the Web pages generated by the attacker. 


3.2. Redirection servers 


The second component in the attack framework is the 
redirection server, which is responsible for redirecting 
the victim to a server that actually performs the exploit. 
Typically, there are one to three additional layers of redi- 
rection, before the victim reaches the exploit server. In 
our case, when a victim visits the compromised site and 
moves the mouse over the fake YouTube player, he or 
she gets redirected (using javascript) to another compro- 
mised domain, which again performs the redirection. We 
observed two major domains being used for redirection, 
and analyzed the working of the redirection server. 
When the victim reaches the redirection server, it 
queries a service named NailCash to obtain the URL for 
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Total | .in | .co.cc | .net | .com 
191 16 28 73 74 


























Table 1: Breakdown of exploit server TLDs. 


redirection. The NailCash service is accessed via an http 
request to feed2.fancyskirt.com. The redirection 
server provides as arguments an API key, a command, 
and a product ID. In this attack, the redirection server 
picks randomly among two API keys. It specifies the 
command as cmd=get TdsUr1, and the product ID as 
product Id=3 (which refers to FakeAV). 

During our observation, the URLs requested were only 
for FakeAV, but it is likely that the same redirection ser- 
vice is used for getting URLs for other types of malware. 

The redirection server caches the received URL for 
10 minutes, and any requests arriving within those 10 
minutes are satisfied from the cache without making a 
request to feed2.fancyskirt.com. Between August 
8th, 2010 and October 13th, 2010 the redirection server 
redirected victims to 453 distinct domains. These do- 
mains were very similar in name, and were all hosted on 
just two /24 IP prefixes. One of them was located in 
Illinois and the other in Amsterdam. 


3.3. Exploit servers 


Finally, the attacker hosts the actual malicious content on 
an exploit server. We found 191 different domains being 
used by the exploit server over time. All the domains 
were hosted on two IP addresses, one located in Quebec, 
Canada and the other in Luxembourg. The exploit server 
does not display the scareware page if the user agent is 
suspicious (such as a search engine crawler), or if the 
referrer is missing. It also refuses connections from IP 
addresses belonging to search engine companies. Most 
of these domains are either .com, .net, or .cO.cc, or .in, 
and the breakdown is shown in Table 1. 


3.4 Results and observations 


We present some of the results from our study. Starting 
with one compromised site, we were able to follow the 
links to other compromised sites and eventually map the 
whole network of compromised sites used in this attack. 
In all, we were able to identify 5400 domains, of which 
around 5000 were active, and the others were either down 
or had been cleaned up. 


Link structure 

Figure 3 shows the number of compromised domains 
each site links to. On average, each domain linked to 202 
other domains, with a median value of 159. In addition, 
each compromised domain also linked to around 80,000 
legitimate domains, since each compromised server had 
around 8,000 keyphrases, each corresponding to an SEO 
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Figure 3: The number of other compromised sites 
each site links to. The degree distribution indicates a 
dense linking structure. 
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Figure 4: The number of sites compromised by the 
attackers each day over a period of three months. 


page, linking to 10 different sites obtained from a Google 
search. The dense link structure helps boost the pager- 
anks of these fake pages in the query results. 


Timeline of compromise 

Figure 4 shows the number of sites compromised on each 
day. We define the time of compromise as the time at 
which the malicious php files were added to the server. 
This time is obtained from the directory listing on the 
server. We find the compromise volume to be rather 
bursty, with most of the servers getting compromised in 
the initial phase of the attack. 

Once the sites are compromised and set up to serve 
the fake pages, we look at how soon the first visit from 
search engine crawlers appear. 

In Figure 5, we see that almost half of the compro- 
mised sites are crawled within four hours of compromise, 
and nearly 85% of the sites are crawled within a day. This 
could either be because search engine crawlers are very 
aggressive at crawling new links, or because the attackers 
are submitting their sites actively to the search engines 
through the Webmaster tools associated with each search 
engine. The dense link structure might also account for 
the quick crawling time of these pages. 


Distribution of keyphrases 
Each compromised server sets up an SEO page for each 
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Figure 5: The interval between a site getting compro- 
mised and the SEO page getting crawled by a search 
engine. 
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Figure 6: The frequency with which each keyphrase 
occurs across the compromised sites. 


of the keyphrases present in the file. Across all the com- 
promised sites, we found 38 different keyphrase files, 
with a total of 20,145 distinct keyphrases. 

Figure 6 plots the distribution of the keyphrases across 
the 38 files. The most popular phrases appear in 37 of 
the 38 files, while nearly 15% of the phrases appear in 
only a single file. In the median case, each phrase is seen 
in 11 different files. To check whether Google trends is 
one of the sources of these keyphrases, we consider all 
the keywords which were listed as Google trends over a 
four month period between May 28th, 2010 and Septem- 
ber 27th, 2010. Out of the 2,125 distinct trend phrases 
in this period, 2,018 (~ 95%) were keyphrases used by 
the attackers. Exploiting trendy keywords is thus another 
characteristic of search poisoning attacks to increase the 
content relevancy. 


Traffic from victims 

This was a large scale attack exploiting over 5,000 com- 
promised sites, each hosting close to 8,000 SEO pages— 
for a total of over 40 million SEO pages. However, this 
does not tell us how successful the attack actually was. 
We would need to take into account what fraction of 
these pages were indexed by search engines, how many 
pages showed up in top search results, and how many 
users clicked on links to these SEO pages. Thus, the 
measure of success of this SEO campaign would be the 
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Figure 7: The arrival of requests at the redirection 
server. 


number of victims who were actually shown the fake 
anti-virus page. 

Unfortunately, the SEO pages on the compromised 
sites do not log information about who visited each link 
(unless it is a search engine crawler). However, all the 
SEO pages cause the user to get redirected to one of two 
redirection servers. By monitoring the logs on the redi- 
rection servers, we estimate the number of visits to the 
FakeAV pages. 

We started monitoring the redirection server on Au- 
gust 27th, 2010, and so missed the first 20 days of the at- 
tack. As explained in Section 3.2, the redirection server 
fetches the redirect-URL from the NailCash service, and 
each time it does this, it adds an entry to a log file. How- 
ever, since the URL is cached for 10 minutes, we miss 
any requests which were satisfied by the cached URL of 
the exploitation server. Figure 7 illustrates the situation. 
The solid arrows indicate observed requests (which were 
written to the log on the redirection server). The grey 
area denotes the interval when request are served from 
the cache, and the dotted arrows denote requests which 
we do not observe because they arrived before the cache 
entry expired. 

In order to estimate the total traffic volume from the 
observed requests, we make the common assumption that 
the requests follow a Poisson arrival process. This im- 
plies that the inter-arrival times are exponentially dis- 
tributed with mean . Since the exponential distribution 
is memoryless, the time to the next arrival has the same 
distribution at any instant. 

Consider again Figure 7. The first request is observed 
at time t = 79. Since the inter-arrival time is memo- 
ryless, the expected time to the next event is the same 
whether we start our observation at t = 79 or at any other 
t Gncluding t = tT) + T’). Therefore, we start our obser- 
vation at t = 7 + 7’, where T is the duration till which a 
fetched redirect URL is cached, and the time to the next 
arrival 6; is a sample from our exponential distribution. 
Similarly, 62, 63,...,6, are other samples from this dis- 
tribution. The mean is then given by: 


1 n 
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Figure 8: The estimated number of victims redirected 
to the FakeAV sites on each day. 


This is valid only for homogeneous exponential func- 
tions, and since Web site visits tend to exhibit diurnal 
patterns, we split the time into chunks of 2 hours, during 
which we assume the inter-arrival distribution to be ho- 
mogeneous. Once we have the mean inter-arrival time \; 
for time interval t;, we can compute the expected number 
of visits Nyjs as: 


“\ len(t;) 
ij 


Nis = 
i=l 
We use this formula to estimate the number of vis- 
its to the redirection server, and plot the results in Fig- 
ure 8. We observe a peak on September 2nd, and then 
a sudden drop after that for a few days. We believe the 
drop occurred because the redirection server was added 
to browser blacklists. On September 7th, the redirection 
server was moved to another domain, and we start seeing 
traffic again. The redirection servers stopped working on 
October 21st, and that marked the end of the SEO cam- 
paign. During this period, we estimate the total number 
of visits to be 60,248, and by extrapolating this number 
to the start of the campaign (August 7th), we estimate 
that there were over 81,000 victims who were taken to 
FakeAV sites. The large number of visits suggests that 
the attack is quite successful in attracting legitimate user 
populations. 


Multiple Compromises 

Perhaps unsurprisingly, we found that many of these vul- 
nerable servers were compromised multiple times by dif- 
ferent attackers. We speculate that these were multiple 
attackers based on the timestamps of when the files were 
added to the server, and the contents of the files. It is pos- 
sible that the same attacker uploaded multiple files to the 
server at different times, but in many cases we see mul- 
tiple php scripts which offer almost identical functional- 
ity, but are slightly different in structure. Also, since we 
observed the bursty nature of compromises, by looking 
at timestamps of these uploaded files and clustering the 
different sites by this timestamp, we can potentially find 
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groups of sites which were compromised by different at- 
tackers at different times. 

This observation suggests that attackers share compro- 
mised server infrastructure, and thus detecting these sites 
can effectively help search engines remove a wide class 
of malicious content. 


4 Detection Method 


The previous section shows how search-result poison- 
ing attacks are typically performed. In this section, we 
present our system deSEO for automatically detecting 
such attacks. This task is challenging as it is expensive 
to test the maliciousness of every link on the Web us- 
ing content-based approaches. Even if we test only links 
that contain trendy keywords, it is not straightforward as 
many SEO pages may look legitimate in response to most 
requests; they deliver malicious content only when cer- 
tain environmental requirements are met, e.g., the use of 
a vulnerable browser, the redirection by search engines, 
or user actions such as mouse movements. Without care- 
ful reverse engineering, it is hard to guess the right envi- 
ronment settings needed to obtain the malicious content 
on the page. 

To automatically detect the SEO links, we revisit the 
attack we analyzed in the previous section. Our study 
yields three key observations of why the SEO attack is 
successful: 


1. Generation of pages with relevant content. 


2. Targeting multiple popular search keywords to in- 
crease coverage. 


3. Creating dense link structures to boost pagerank. 


Attackers first need to automatically generate pages 
that look relevant to search engines. In addition, one 
page alone may not be able to bring them many victims, 
so attackers often generate many pages to cover a wide 
range of popular search keywords. To promote these 
pages to the top of the search results, attackers need to 
hijack the reputation of compromised servers and create 
dense link structures to boost pagerank. 

We draw on the first two observations when designing 
our detection method. We do not look at the link struc- 
tures of Web pages because that would require crawling 
and downloading all pages to extract the cross-link in- 
formation. In this paper, we show that studying just the 
structure of URLs works well enough to detect SEO at- 
tacks of this type. 

Further, we observe that SEO links are often set up on 
compromised Web servers. These servers usually change 
their behavior after being hacked: many new links are 
added, usually with different URL structures from the 
old URLs. In addition, since attackers control a large 
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number of compromised servers and generate pages us- 
ing scripts, their URL structures are often very similar 
across compromised domains. Therefore, we can recog- 
nize SEO attacks by looking for newly created pages that 
share the same structure on different domains. By doing 
so, we can identify a group of compromised servers con- 
trolled by the same attacker (or the same SEO campaign), 
rather than reasoning about individual servers. 

At a high level, deSEO uses three steps for detection. 
The first step is to identify suspicious Web sites that ex- 
hibit a change in behavior with respect to their own his- 
tory. In the second step, we derive lexical features for 
each suspicious Web site and cluster them. In the last 
step, we perform group analysis to pick out suspicious 
SEO clusters. Next, we explain these three steps in de- 
tail. 


4.1 History-based detection 


In the first step, deSEO identifies suspicious Web sites 
that may have been compromised by attackers. SEO 
pages typically have keywords in the URL because 
search engines take those into consideration when com- 
puting the relevance of pages for a search request [7]. 
So, we study all URLs that contain keywords. Option- 
ally, we could also focus on URLs that contain popular 
search keywords because most SEO attacks aim to poi- 
son these keywords so as to maximize their impact and 
reach many users. 

While it is common for Web sites to have links that 
contain keywords, URLs on compromised servers are all 
newly set up, so their structures are often different from 
historical URLs from the same domains. 

Specifically, for each URL that contains keywords 

delimited by common separators such as + and -, 
we extract the URL prefix before the keywords. 
For example, consider the following URL http: 
//www.askania-fachmaerkte.de/images/ 
news .php?page=lisatroberts+gillan. 
The keywords in the URL are lisa roberts 
gillan and the URL prefix before the keywords 
is http://www.askania-fachmaerkte.de/ 
images/news.php?page=. 

If the corresponding Web site did not have pages start- 
ing with the same URL prefix before, we consider the 
appearance of a new URL prefix as suspicious and fur- 
ther process them in the next step. 


4.2 Clustering of suspicious domains 


In the second step, deSEO proceeds to cluster URLs so 
that malicious links from the same SEO campaign will 
be grouped together, under the assumption that they are 
generated by the same script. 

Similar to previous URL-based approaches for spam 
detection [13,12], we extract lexical features from URLs. 
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Empirically, we select the following features: 


1. String features: separator between keywords, argu- 
ment name, filename, subdirectory name before the 
keywords. 


2. Numerical features: number of arguments in the 
URL, length of arguments, length of filename, 
length of keywords. 


3. Bag of words: keywords. 


In our previous URL example, the separator between 
keywords is “+”, the argument name is page, the file- 
name is news.php, the directory before the keywords 
is images, the number of arguments in the URL is one, 
the length of arguments is four, the length of filename 
is nine, and the bag of words is {lisa, roberts, 
gillan} 

As most malicious URLs created on the same do- 
main have similar structure, we aggregate URL features 
together by domain names and study the similarities 
across domains. Note that we consider sub-domains sep- 
arately. For example, abcd. blogspot.com is con- 
sidered a separate domain because it is possible for a sub- 
domain to get compromised rather than the entire domain 
blogspot.com. When aggregating for string features, 
we take the feature value that covers the most URLs in 
the domain; for numerical features, we take the median; 
and for bags of words, we take the union of bags. 

In contrast to previous work that use URLs for spam 
detection, where a binary classification of URL is suffi- 
cient [13,12], our goal is to cluster URLs. We adopt 
the widely used K-means++ method [2]. Initially, we se- 
lect K centroids that are distant from each other. Next 
we apply the K-means algorithm to compute K clusters. 
We select and output clusters that are tight, i.e., having 
low residual sum of squares (the squared distance of each 
data point from the cluster centroid). For the remaining 
data points, we iteratively apply the K-means algorithm 
until no more big clusters (with at least 10 domains) can 
be selected. 

Note that neither the computation of distances be- 
tween data points nor the calculation of the cluster cen- 
troid is straightforward because we have many features 
with some of them being non-numerical values. We 
normalize feature dimensions so that distances fall into 
a weighted high-dimensional space, with the values of 
each dimension ranging from 0 to 1. For string features, 
identical values have a distance of 0 and the distance is 
set to 1 otherwise. For numerical features, we define 
the distance as the difference in numerical values, nor- 
malized by the maximum value seen in the dataset. For 
bags of words features, the distance between two bags 
of words A = aj, 4@3,...,dn and B = 04, bg,..., bm is 
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defined as a When picking a weight for each di- 
mension, we give a higher weight (the value 2) to string 
features as it is relatively infrequent for different URLs to 
have identical string features. For all other dimensions, 
we give an equal weight of 1. 

When computing centroids, we adopt the same 
method we use to aggregate URL features into domain 
features, treating all URLs in a cluster as if they were 
from the same domain. If we find a cluster with the resid- 
ual error of squares normalized by the cluster size lower 
than the preset threshold, we output the cluster. Empiri- 
cally, we find both the weight selection and the threshold 
selection are not sensitive to the results as most malicious 
clusters are very tight in distance and stand out easily. 


4.3 Group analysis 


Finally, we perform group analysis to pick compromised 
domain groups and filter legitimate groups. In the previ- 
ous steps, we leverage the fact that compromised sites 
change behavior after the compromise and their link 
structures are similar. In this step, we leverage another 
important observation, namely that SEO links in one 
campaign share a similar page structure (not just the 
URL structure). 

One way to measure the similarity of two Web page 
structures is to compare their parsed HTML tree struc- 
ture [21]. This approach is heavy-weight because we 
need to implement a complete HTML page parser, de- 
rive the tree representations, and perform tree difference 
computations. For simplicity, we focus on simpler fea- 
tures that are effective at characterizing pages. For in- 
stance, we simply use the number of URLs in each page, 
and we find this feature works well empirically. 

We sample N (set to 100) pages from each group 
and crawl these pages. Then we extract the number of 
URLs per page and build a histogram. Figure 9 plots 
the histogram of the number of URLs of a legitimate 
group, while Figure 10 plots the histogram for a mali- 
cious group. We can clearly see that the legitimate group 
has a diverse number of URLs. But the malicious one 
has very similar pages with almost identical number of 
URLs per page. The small fraction of zero link pages 
are caused by pages that no longer exist (possibly cor- 
responding to compromised Web servers that have since 
been cleaned up). 

We normalize each histogram and compute peaks in 
the normalized histograms. If the histogram has high 
peak values, we output the group as a suspicious SEO 
group and manually check the group. Although here we 
still use manual investigation to pick out the final groups, 
the amount of work is actually small. We show in Sec- 
tion 5 that deSEO outputs less than 20 groups. There- 
fore, a human expert only needs to check several sample 
URLs in each group, rather than reasoning about millions 
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Figure 9: An example legitimate group that has diverse 
distribution of number of URLs in each Web page. 


of URLs that contain keywords one by one. 


Finally, for each malicious group, deSEO outputs reg- 
ular expression signatures using the signature generation 
system AutoRE [28]. Since URLs within a group have 
similar features, most groups output only one signature. 
We apply the derived signatures to large search engines 
and are able to capture a broad set of attacks appearing 
in search engine results (see details in Section 6.3). 


5 Results 


In this section, we describe our datasets, which consist 
of large sets of Web URLs, search engine query logs, 
snapshot of Web content, and trendy keywords. Using 
these datasets, we evaluate the effectiveness of deSEO in 
identifying malicious groups of URLs corresponding to 
different SEO attacks. 


5.1 Dataset 


We collect three sampled sets of URLs from Bing. These 
URLs are sampled from all URLs that the crawler saw 
during the months of June 2010, September 2010, and 
January 2011. Each sampled set of URLs contains over 
a hundred billion URLs. We use the June URLs as a 
historical snapshot and apply deSEO to September and 
January URL sets. 


The second dataset we use is a sampled search query 
log from September 2010 that contains over | billion 
query requests. It records information about each query 
such as query terms, clicks, query IP address, cookie, 
and user agent. Because of privacy concerns, cookies 
and user agents are anonymized by hashing. In addition, 
when we look at IP addresses in the log, we focus on 
studying the IP addresses of compromised Web servers, 
rather than individual normal users. 


The trendy keywords we use are obtained from Google 
Trends [9]. We collect daily Google Trends keywords 
from May 28th, 2010 to February 3rd, 2011. Each day 
has 20 popular search terms. 
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Figure 10: An example malicious group that has a sim- 
ilar number of URLs in each Web page. 


5.2 Detection results 


5.2.1 History-based detection 


We apply the history-based detection to the URLs of 
September and January. Since we have over a hundred 
billion URLs, to reduce the processing overhead, we first 
filter out the top 10,000 Alexa [1] Web sites as we be- 
lieve those servers are relatively well managed and have 
a lower chance of getting compromised. Later, after we 
derive regular expression patterns, we could apply them 
to URLs corresponding to these Web sites to detect ma- 
licious ones, if any, hosted by these servers. 

















With trendy keyword With new structure 
Month | Domains URLs Domains URLs 
Sept 10 | 428,430 | 1,481,766 | 136,387 366,767 
Jan 11 512,617 | 3,255,140 | 211,225 | 1,102,878 























Table 2: History-based URL filtering. 


We extract all URLs on remaining domains that con- 
tain trendy keywords. Table 2 shows the results. In 
September, over | million URLs have trendy keywords, 
but in Jan the number jumps to 3 million, showing the 
potential increase of SEO attacks. We next choose URLs 
with new URL prefixes by comparing the URL prefixes 
of September 2010 and January 2011 to those of June 
2010. For URLs that contain new prefixes, we select 
them and pass them to the next step. This step removes 
about two thirds of the URLs. 


5.2.2 Clustering results 


We extract the domain features as described in Section 
4.2 and apply the K-means++ algorithm to cluster these 
domains. We vary the value of K and obtain similar 
results since we apply the K-means++ algorithm iter- 
atively. Table 3 shows the results of K=100. (The 
third and fourth columns are explained below.) For both 
months, the clustering algorithm outputs hundreds of 


groups. 
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Figure 11: The distribution of peak values: percent- 
age of pages sharing the same number of URLs within 
a group. 


5.2.3. Group analysis 


As we have grouped similar Web sites into a small num- 
ber of groups, we use group similarity to distinguish le- 
gitimate groups from malicious ones. We use the URL 
features mentioned in Section 4.3 to filter out obvious 
false-positive groups. 

Figure 11 shows the distribution of the peak value 
among all groups. We can see that there are a small 
number of groups that have high peak values. But most 
groups have small peak values as their pages are diverse. 
We pick a threshold for the peak value of 0.45, which fil- 
ters most legitimate groups, as shown in the third column 
of Table 3. After filtering, less than 20 groups remain, 
and we manually go through these groups to pick out ma- 
licious ones. 

















Number of groups 
Month | Total | Above threshold | Malicious 
Sept 10 | 290 14 9 
Jan 11 | 272 16 11 




















Table 3: Clustering and group analysis results. 


In total, we find 9 malicious groups from the Septem- 
ber data and 11 groups from the January data. The reg- 
ular expressions derived from two datasets mostly over- 
lap. This shows that there are a relatively small num- 
ber of SEO campaigns, and that they are long-lasting. 
Hence, capturing one signature can be useful to capture 
many compromised sites over time. In total, we capture 
957 unique compromised domains and 15,482 malicious 
URLs in our sampled datasets. 

Figure 12 shows a few derived regular expres- 
sion samples. These include expressions that match 
the URLs of compromised servers that we study in 
Section 3, but also a number of new ones. Note that 
some of the regular expressions may look generic, 
€.2., */index.php/?w{4, 5}= (w+ (+w+t) +) §, 
which matches malicious URLs _ like: http: 
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//www.kantana.com/2009/index.php/ 
?bqfb=justintbiebert+breakstneck. At first 
glance, one might think index.php followed by words 
would match many legitimate URLs, but it turns out that 
it is rare to have “/?” in between. Further, the word 
baqfb makes it even clearer that this is an automatically 
generated URL. 


6 Attack Analysis 


In this section, we leverage the results produced by de- 
SEO to gain more insights into SEO attacks. First, we 
study a new attack found by deSEO, which has a differ- 
ent link structure than the one we detect in Section 3. 
Second, we study the search engine queries originating 
from the IP addresses of compromised servers, as SEO 
toolkits often query the search engines to generate SEO 
pages. Finally we apply the derived regular expressions 
to live search results to detect a broad set of attacks. 


6.1 Study of new attack 


By examining deSEO’s captured malicious groups, we 
find another SEO attack that uses a different methodol- 
ogy for setting up SEO pages, boosting their page ranks, 
and polluting the search index. We believe that this SEO 
campaign is probably orchestrated by a different group of 
attackers. We now characterize the differences between 
this attack and the attack that we initially studied. 


6.1.1 Link structure 


This attack makes use of two sets of servers—one set that 
hosts SEO pages that redirect to an exploit server, and a 
second set of pointer pages that link to only these SEO 
pages. 

We find 120 pointer pages, all of which are hosted on 
hacked Wordpress [24] blogs. Further analysis shows 
that these are older versions of Wordpress that have vul- 
nerabilities, and attackers use one of these vulnerabilities 
to modify the xmlrpc.php files that are included in 
the Wordpress installation by default. Each pointer page 
contains 500 links to SEO pages hosted on 12 different 
domains. The pointer pages are dynamic in that the set 
of links contained in a page changes each time the page 
is visited, and the set of the 12 domains also changes on 
a daily basis. 

In all, we find SEO pages hosted on 976 domains. 
Similar to the previous attack, the SEO pages contain 
content relevant to the poisoned terms and redirect users 
to the exploit server. However, new to this attack, the 
SEO pages did not link to each other. Instead, they re- 
lied on incoming links from the pointer pages to boost 
their pageranks, as well as to populate the search engine 
index with new SEO pages. However, in addition, the 
SEO pages started linking to each other starting in Jan- 
uary 2011. This change suggests that the attackers are 


20th USENIX Security Symposium 309 


Regex 
.*\/images\/\w+(-\w+)+\.html 


http://usedcarsdotcom.com.au/images/eddie-fisher.html 

http://www. rawstrokes.com/cart/images/page.php?page=justin+biebe 
rthatestkorea&check=dd35923778116c82bc9c5b102ea9e260 
http://www.soundsonshellac.com/robots.txt/?showc=tour+de+france+ 
stage+3 
http://randomlyinsaneadventures.com/xmlrpc.php/?showc=sectmedia 
+days+2010 

http://www.po- 
kwong.com/product/images/watch/index.php?q=justintmoore 


http://www. kantana.com/2009/index.php/?bafb=justin+bieber+breaks 
+neck 


.*\/image\/page\.php\?page=\w+(\+\w+)+ 


.*\/robots.txt\/\?showc=\w+(\+\w+)+S 


.*\/xmlrpc\.php\/\?showc=\w+(\+\w+t)+$ 


.*\/images\/watch\/index\.php\?q=(\w+(\+\w+)+)$ 


.*\/[a-z]{4,5}.php\?[a-z]{3,5}=\w+(\+\w+)+$ 
*\/[a-z]{3,7}\.php\?[a-z]{1,7}=(\w+(%20\w+)+)$ 


.*\/index\.php\/\?\w{4,5}=(\w+(\+\w+)+)$ 





Figure 12: Examples of derived regular expressions. 
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constantly trying to improve the ranking of their pages 
using different strategies. 


6.1.2 Use of cloaking 


This attack makes use of cloaking, both for the pointer 
pages and the SEO pages. When a pointer page is ac- 
cessed by a legitimate user, the original page content is 
displayed, but if the page is accessed by a search-engine 
crawler (identified by the user-agent string), then a page 
containing links to the SEO pages is displayed. 

The SEO pages behave differently depending on how 
they are accessed. When accessed by a search engine 
crawler, the page displayed is optimized for the poisoned 
keywords. When a regular user accesses the page, he/she 
is redirected to the exploit server, provided the referrer 
field matches a known search engine and the user-agent 
field indicates a Windows machine. In all other cases, 
the SEO page redirects to a benign page. 


6.1.3 Redirection and exploit infrastructure 


This attack makes use of a completely different set of 
redirection and exploit servers, though the FakeAV page 
displayed at the end is almost identical. In comparison 
with the previous attack, these SEO pages go through an 
extra level of redirection for reaching the exploit server. 
We find a total of 485 exploit domains, hosted on two 
sets of IP address in the US (in Texas and New Jersey). 


6.2 


We check whether we see queries from the compromised 
servers captured by deSEO using the Bing search log. 
Queries from Web servers can be viewed as a signal of 
potential compromise, as they could indicate search en- 
gine scraping activities in order to generate content for 
SEO pages. Less than 5% of the top 500 Alexa Web sites 
ever submitted queries during the month of September 
2010, while 46% of the compromised servers did. Note 


Queries from compromised servers 
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that this does not necessarily mean the remaining ones 
do not issue queries to search engines. They could have 
been inactive during that month, or could have chosen to 
use other search engines. 


The queries from legitimate sites are mostly infrequent 
(less than one per day). These may be generated by 
human administrators, who are logged in on the Web 
servers. Only around 1% of legitimate sites generated 
a large number of queries. These queries went through 
the affiliate program that partners with the search engine 
company to provide search results. 


Queries from the IPs of compromised servers are 

more frequent than those of legitimate sites. In addition, 
queries from the same group have similar behavior. 
Often, they present the same user-agent string, e.g., 
“Mozilla/5.0 (Windows; U; Windows NT 
5.1; en-US; rv:1.9.2) Gecko/20100115 
Firefox/3.6”. 
Relying on this user-agent string, together with trendy 
keywords, we detect other IP addresses that share the 
same pattern. Accurately determining the compromised 
domains hosted on these IP addresses is challenging, 
though, because compromised servers are usually small 
Web servers hosted on hosting infrastructures. It is 
common to have many domains (sometimes tens of 
thousands) sharing the same IP address. Therefore, 
seeing bad activities from an IP address is not sufficient 
to pinpoint the exact compromised server. 


Besides trendy keyword queries, we also identify a 
number of other malicious queries from these compro- 
mised servers. For example, there are queries of the form 
of site: < hosting_site >. This query returns all the 
pages from a particular site. What is interesting is that 
the site specified in the query is also hosted on the same 
IP address that issued the query. Such queries are seen 
when a site is compromised, and the attackers try to de- 
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60 trendy keywords 60 attacker poisoned keywords 
# of matched searches | # of matched URLs | # of matched searches | # of matched URLs 
Google 16 39 27 124 
Bing 0 0 1 1 














Table 4: Matching Google and Bing search results using derived regular expressions. 


termine which pages to inject code into; they typically 
pick the most popular pages, i.e. the ones that show up 
high in the search results. 


6.3 Matching Google and Bing queries 


We apply our derived regular expressions of Sec- 
tion 5.2.3 to the Google and Bing search engines. We use 
two sets of query terms. The first set is a set of 60 trendy 
keywords obtained from Google Trends (February Ist to 
February 3rd, 2011). The second set is a set of 60 key- 
words poisoned by attackers (but not in Google Trends), 
which were randomly selected from the keywords ap- 
pearing in captured malicious URLs. For each keyword, 
we manually perform Web search and then extract the top 
100 results returned from Google and Bing. We use only 
60 search terms because the search queries are issued 
manually—automated queries and screen-scraping are 
against the terms of use, and the search results obtained 
using the search APIs are not consistent with the results 
obtained through a browser. (While we do not know why 
the API results differ from the browser-based search re- 
sults, we speculate that the API search results are not as 
fresh, so contain older and more well-established links.) 

For a total of 120 keywords, 36% of them yield at 
least one malicious link in the top 100 results (which 
are spread over ten pages). Table 4 shows the detailed 
results. Not surprisingly, attackers are even more suc- 
cessful in poisoning non-trendy keywords that they se- 
lect (45% match rate). This is because fewer Web pages 
may match these keywords and hence it can be easier 
for malicious links to appear among the top search re- 
sults. Their distribution, i.e., how high in the search re- 
sults these links are displayed, is shown in Figure 13. 
We can see that the malicious links are spread over all 
of the top 10 pages. Similar to previous reports [19], we 
find Bing top search results contain relatively fewer ma- 
licious links. (Experiments were conducted in February 
2011; the search results of both Google and Bing have 
been improved since then.) 

We manually verified all the matches and did not find 
false positives. All of the matches are generated by only 
two regular expressions—the first and the seventh in Fig- 
ure 12. 

We run all matching URLs through Firefox using the 
Google Safebrowsing API and Internet Explorer (using 
its internal blacklist), and none of them were blocked by 
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Figure 13: The number of malicious links found in 
different pages of the search results for 60 popular 
keywords. 


either browser at the time of the experiment. This result 
indicates that deSEO is able to capture live attacks that 
have not yet been reported. 


7 Discussion and Conclusion 


In this paper, we study a large-scale, live search-result 
poisoning attack that leverages SEO techniques. Based 
on our observations, we develop a system called de- 
SEO that automatically detects additional malicious SEO 
campaigns. By deriving URL signatures from our results 
and applying them to both Google and Bing, we find 36% 
of searches yield links to malicious pages among their 
top results. Our paper appears to be the first to present 
a systematic study of search-result poisoning attacks and 
how to detect them. 

Attackers may wish to evade deSEO detection by not 
embedding keywords in URLs. However, this approach 
reduces the chance of getting SEO links to the top search 
results, because keywords in URLs appear to be an im- 
portant feature for relevance computation. Also, it may 
reduce the chance of clicks by end users, as URLs with 
keywords look more relevant to users who search using 
these keywords. Attackers may also wish to diversify 
the SEO link structures so that they look different across 
different domains. Our history-based detection will still 
pick such SEO links as long as their URL structures ap- 
pear different than those used previously by the same do- 
mains. To further detect the diversified SEO links as a 
group, we could alternatively adopt content-based solu- 
tions by comparing their page similarity [21], possibly 
on virtual machines [22]. 
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In our study, we find that attackers usually put up a 
large number of new pages after compromising a Web 
site, which is often relatively inactive before compro- 
mise. Therefore, search engines could give a lower rank 
to new pages on previously inactive sites. Search engines 
could also consider dense link structures to identify SEO 
attacks, if they are willing to crawl most of the malicious 
pages and if they can afford to perform offline analysis. 
In addition, we notice that the contents of SEO pages are 
mostly irrelevant to the compromised site’s homepage, 
and sometimes even the language is different. Therefore, 
semantic-based approaches are also promising avenues 
for further investigation. 
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Abstract 


The fluidity of application markets complicate smart- 
phone security. Although recent efforts have shed light 
on particular security issues, there remains little insight 
into broader security characteristics of smartphone ap- 
plications. This paper seeks to better understand smart- 
phone application security by studying 1,100 popular 
free Android applications. We introduce the ded decom- 
piler, which recovers Android application source code 
directly from its installation image. We design and exe- 
cute a horizontal study of smartphone applications based 
on static analysis of 21 million lines of recovered code. 
Our analysis uncovered pervasive use/misuse of person- 
al/phone identifiers, and deep penetration of advertising 
and analytics networks. However, we did not find ev- 
idence of malware or exploitable vulnerabilities in the 
studied applications. We conclude by considering the 
implications of these preliminary findings and offer di- 
rections for future analysis. 


1 Introduction 


The rapid growth of smartphones has lead to a renais- 
sance for mobile services. Go-anywhere applications 
support a wide array of social, financial, and enterprise 
services for any user with a cellular data plan. Appli- 
cation markets such as Apple’s App Store and Google’s 
Android Market provide point and click access to hun- 
dreds of thousands of paid and free applications. Mar- 
kets streamline software marketing, installation, and 
update—therein creating low barriers to bring applica- 
tions to market, and even lower barriers for users to ob- 
tain and use them. 

The fluidity of the markets also presents enormous se- 
curity challenges. Rapidly developed and deployed ap- 
plications [40], coarse permission systems [16], privacy- 
invading behaviors [14, 12, 21], malware [20, 25, 38], 
and limited security models [36, 37, 27] have led to ex- 
ploitable phones and applications. Although users seem- 
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ingly desire it, markets are not in a position to provide 
security in more than a superficial way [30]. The lack of 
a common definition for security and the volume of ap- 
plications ensures that some malicious, questionable, and 
vulnerable applications will find their way to market. 

In this paper, we broadly characterize the security of 
applications in the Android Market. In contrast to past 
studies with narrower foci, e.g., [14, 12], we consider a 
breadth of concerns including both dangerous functional- 
ity and vulnerabilities, and apply a wide range of analysis 
techniques. In this, we make two primary contributions: 


e We design and implement a Dalvik decompilier, 
ded. ded recovers an application’s Java source 
solely from its installation image by inferring lost 
types, performing DVM-to-JVM bytecode retarget- 
ing, and translating class and method structures. 

e We analyze 21 million LOC retrieved from the top 
1,100 free applications in the Android Market using 
automated tests and manual inspection. Where pos- 
sible, we identify root causes and posit the severity 
of discovered vulnerabilities. 

Our popularity-focused security analysis provides in- 

sight into the most frequently used applications. Our 
findings inform the following broad observations. 


1. Similar to past studies, we found wide misuse of 
privacy sensitive information—particularly phone 
identifiers and geographic location. Phone iden- 
tifiers, e.g., IMEI, IMSI, and ICC-ID, were used 
for everything from “cookie-esque” tracking to ac- 
counts numbers. 

2. We found no evidence of telephony misuse, back- 
ground recording of audio or video, abusive connec- 
tions, or harvesting lists of installed applications. 

3. Ad and analytic network libraries are integrated 
with 51% of the applications studied, with Ad Mob 
(appearing in 29.09% of apps) and Google Ads (ap- 
pearing in 18.72% of apps) dominating. Many ap- 
plications include more than one ad library. 
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4. Many developers fail to securely use Android APIs. 
These failures generally fall into the classification 
of insufficient protection of privacy sensitive infor- 
mation. However, we found no exploitable vulnera- 
bilities that can lead malicious control of the phone. 


This paper is an initial but not final word on An- 
droid application security. Thus, one should be cir- 
cumspect about any interpretation of the following re- 
sults as a definitive statement about how secure appli- 
cations are today. Rather, we believe these results are 
indicative of the current state, but there remain many 
aspects of the applications that warrant deeper analy- 
sis. We plan to continue with this analysis in the fu- 
ture and have made the decompiler freely available at 
http://siis.cse.psu.edu/ded/ to aid the broader 
security community in understanding Android security. 

The following sections reflect the two thrusts of this 
work: Sections 2 and 3 provide background and detail 
our decompilation process, and Sections 4 and 5 detail 
the application study. The remaining sections discuss our 
limitations and interpret the results. 


2 Background 


Android: Android is an OS designed for smartphones. 
Depicted in Figure 1, Android provides a sandboxed ap- 
plication execution environment. A customized embed- 
ded Linux system interacts with the phone hardware and 
an off-processor cellular radio. The Binder middleware 
and application API runs on top of Linux. To simplify, 
an application’s only interface to the phone is through 
these APIs. Each application is executed within a Dalvik 
Virtual Machine (DVM) running under a unique UNIX 
uid. The phone comes pre-installed with a selection of 
system applications, e.g., phone dialer, address book. 
Applications interact with each other and the phone 
through different forms of IPC. Intents are typed inter- 
process messages that are directed to particular appli- 
cations or systems services, or broadcast to applications 
subscribing to a particular intent type. Persistent content 
provider data stores are queried through SQL-like inter- 
faces. Background services provide RPC and callback 
interfaces that applications use to trigger actions or ac- 
cess data. Finally user interface activities receive named 
action signals from the system and other applications. 
Binder acts as a mediation point for all IPC. Access 
to system resources (e.g., GPS receivers, text messag- 
ing, phone services, and the Internet), data (e.g., address 
books, email) and IPC is governed by permissions as- 
signed at install time. The permissions requested by the 
application and the permissions required to access the 
application’s interfaces/data are defined in its manifest 
file. To simplify, an application is allowed to access a 
resource or interface if the required permission allows 


20th USENIX Security Symposium 





Installed Applications 


> 
a 
8 
= 
2 
= 
St 


uolyeoyddy 
uolyeoyddy 
uolyeoyddy 











Figure 1: The Android system architecture 


it. Permission assignment—and indirectly the security 
policy for the phone—is largely delegated to the phone’s 
owner: the user is presented a screen listing the permis- 
sions an application requests at install time, which they 
can accept or reject. 


Dalvik Virtual Machine: Android applications are writ- 
ten in Java, but run in the DVM. The DVM and Java byte- 
code run-time environments differ substantially: 


Application Structure. Java applications are composed 
of one or more .class files, one file per class. The JVM 
loads the bytecode for a Java class from the associated 
.class file as it is referenced at run time. Conversely, a 
Dalvik application consists of a single . dex file contain- 
ing all application classes. 

Figure 2 provides a conceptual view of the compila- 
tion process for DVM applications. After the Java com- 
piler creates JVM bytecode, the Dalvik dx compiler con- 
sumes the .class files, recompiles them to Dalvik byte- 
code, and writes the resulting application into a single 
. dex file. This process consists of the translation, recon- 
struction, and interpretation of three basic elements of 
the application: the constant pools, the class definitions, 
and the data segment. A constant pool describes, not sur- 
prisingly, the constants used by a class. This includes, 
among other items, references to other classes, method 
names, and numerical constants. The class definitions 
consist in the basic information such as access flags and 
class names. The data element contains the method code 
executed by the target VM, as well as other information 
related to methods (e.g., number of DVM registers used, 
local variable table, and operand stack sizes) and to class 
and instance variables. 


Register architecture. The DVM is register-based, 
whereas existing JVMs are stack-based. Java bytecode 
can assign local variables to a local variable table before 
pushing them onto an operand stack for manipulation by 
opcodes, but it can also just work on the stack without 
explicitly storing variables in the table. Dalvik bytecode 
assigns local variables to any of the 2!° available regis- 
ters. The Dalvik opcodes directly manipulate registers, 
rather than accessing elements on a program stack. 


Instruction set. The Dalvik bytecode instruction set is 
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Figure 2: Compilation process for DVM applications 


substantially different than that of Java. Dalvik has 218 
opcodes while Java has 200; however, the nature of the 
opcodes is very different. For example, Java has tens 
of opcodes dedicated to moving elements between the 
stack and local variable table. Dalvik instructions tend to 
be longer than Java instructions; they often include the 
source and destination registers. As a result, Dalvik ap- 
plications require fewer instructions. In Dalvik bytecode, 
applications have on average 30% fewer instructions than 
in Java, but have a 35% larger code size (bytes) [9]. 


Constant pool structure. Java applications replicate ele- 
ments in constant pools within the multiple . class files, 
e.g., referrer and referent method names. The dx com- 
piler eliminates much of this replication. Dalvik uses a 
single pool that all classes simultaneously reference. Ad- 
ditionally, dx eliminates some constants by inlining their 
values directly into the bytecode. In practice, integers, 
long integers, and single and double precision floating- 
point elements disappear during this process. 


Control flow Structure. | Control flow elements such 
as loops, switch statements and exception handlers are 
structured differently in Dalvik and Java bytecode. Java 
bytecode structure loosely mirrors the source code, 
whereas Dalvik bytecode does not. 


Ambiguous primitive types. Java bytecode vari- 
able assignments distinguish between integer (int) and 
single-precision floating-point (float) constants and be- 
tween long integer (long) and double-precision floating- 
point (double) constants. However, Dalvik assignments 
(int/float and long/doub1e) use the same opcodes for 
integers and floats, e.g., the opcodes are untyped beyond 
specifying precision. 

Null references. The Dalvik bytecode does not specify 
a null type, instead opting to use a zero value constant. 
Thus, constant zero values present in the Dalvik byte- 
code have ambiguous typing that must be recovered. 


Comparison of object references. |The Java bytecode 
uses typed opcodes for the comparison of object refer- 
ences (if -acmpeg and if _acmpne) and for null compar- 
ison of object references (ifnull and ifnonnull). The 
Dalvik bytecode uses a more simplistic integer compar- 
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ison for these purposes: a comparison between two in- 
tegers, and a comparison of an integer and zero, respec- 
tively. This requires the decompilation process to recover 
types for integer comparisons used in DVM bytecode. 


Storage of primitive types in arrays. The Dalvik byte- 
code uses ambiguous opcodes to store and retrieve el- 
ements in arrays of primitive types (e.g., aget for in- 
t/float and aget-wide for long/double) whereas the cor- 
responding Java bytecode is unambiguous. The array 
type must be recovered for correct translation. 


3 The ded decompiler 


Building a decompiler from DEX to Java for the study 
proved to be surprisingly challenging. On the one hand, 
Java decompilation has been studied since the 1990s— 
tools such as Mocha [5] date back over a decade, with 
many other techniques being developed [39, 32, 31, 4, 
3, 1]. Unfortunately, prior to our work, there existed no 
functional tool for the Dalvik bytecode.! Because of the 
vast differences between JVM and DVM, simple modifi- 
cation of existing decompilers was not possible. 

This choice to decompile the Java source rather than 
operate on the DEX opcodes directly was grounded in 
two reasons. First, we wanted to leverage existing tools 
for code analysis. Second, we required access to source 
code to identify false-positives resulting from automated 
code analysis, e.g., perform manual confirmation. 

ded extraction occurs in three stages: a) retarget- 
ing, b) optimization, and c) decompilation. This sec- 
tion presents the challenges and process of ded, and con- 
cludes with a brief discussion of its validation. Interested 
readers are referred to [35] for a thorough treatment. 


3.1 Application Retargeting 


The initial stage of decompilation retargets the applica- 
tion .dex file to Java classes. Figure 3 overviews this 
process: (1) recovering typing information, (2) translat- 
ing the constant pool, and (3) retargeting the bytecode. 


Type Inference: The first step in retargeting is to iden- 
tify class and method constants and variables. However, 
the Dalvik bytecode does not always provide enough in- 
formation to determine the type of a variable or constant 
from its register declaration. There are two generalized 
cases where variable types are ambiguous: 1) constant 
and variable declaration only specifies the variable width 
(e.g., 32 or 64 bits), but not whether it is a float, integer, 
or null reference; and 2) comparison operators do not 
distinguish between integer and object reference compar- 
ison (i.e., null reference checks). 

Type inference has been widely studied [44]. The sem- 
inal Hindley-Milner [33] algorithm provides the basis for 
type inference algorithms used by many languages such 
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Figure 3: Dalvik bytecode retargeting 


as Haskell and ML. These approaches determine un- 
known types by observing how variables are used in op- 
erations with known type operands. Similar techniques 
are used by languages with strong type inference, e.g., 
OCAML, as well weaker inference, e.g., Perl. 

ded adopts the accepted approach: it infers register 
types by observing how they are used in subsequent op- 
erations with known type operands. Dalvik registers 
loosely correspond to Java variables. Because Dalvik 
bytecode reuses registers whose variables are no longer 
in scope, we must evaluate the register type within its 
context of the method control flow, i.e., inference must 
be path-sensitive. Note further that ded type inference is 
also method-local. Because the types of passed param- 
eters and return values are identified by method signa- 
tures, there is no need to search outside the method. 

There are three ways ded infers a register’s type. First, 
any comparison of a variable or constant with a known 
type exposes the type. Comparison of dissimilar types 
requires type coercion in Java, which is propagated to 
the Dalvik bytecode. Hence legal Dalvik comparisons al- 
ways involve registers of the same type. Second, instruc- 
tions such as add-int only operate on specific types, 
manifestly exposing typing information. Third, instruc- 
tions that pass registers to methods or use a return value 
expose the type via the method signature. 

The ded type inference algorithm proceeds as follows. 
After reconstructing the control flow graph, ded identi- 
fies any ambiguous register declaration. For each such 
register, ded walks the instructions in the control flow 
graph starting from its declaration. Each branch of the 
control flow encountered is pushed onto an inference 
stack, e.g., ded performs a depth-first search of the con- 
trol flow graph looking for type-exposing instructions. If 
a type-exposing instruction is encountered, the variable 
is labeled and the process is complete for that variable.” 
There are three events that cause a branch search to ter- 
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minate: a) when the register is reassigned to another vari- 
able (e.g., a new declaration is encountered), b) when a 
return function is encountered, and c) when an exception 
is thrown. After a branch is abandoned, the next branch 
is popped off the stack and the search continues. Lastly, 
type information is forward propagated, modulo register 
reassignment, through the control flow graph from each 
register declaration to all subsequent ambiguous uses. 
This algorithm resolves all ambiguous primitive types, 
except for one isolated case when all paths leading to 
a type ambiguous instruction originate with ambiguous 
constant instructions (e.g., all paths leading to an integer 
comparison originate with registers assigned a constant 
zero). In this case, the type does not impact decompila- 
tion, and a default type (e.g., integer) can be assigned. 


Constant Pool Conversion: The .dex and .class file 
constant pools differ in that: a) Dalvik maintains a sin- 
gle constant pool for the application and Java maintains 
one for each class, and b) Dalvik bytecode places primi- 
tive type constants directly in the bytecode, whereas Java 
bytecode uses the constant pool for most references. We 
convert constant pool information in two steps. 

The first step is to identify which constants are needed 
for a .class file. Constants include references to 
classes, methods, and instance variables. ded traverses 
the bytecode for each method in a class, noting such ref- 
erences. ded also identifies all constant primitives. 

Once ded identifies the constants required by a class, 
it adds them to the target .class file. For primitive type 
constants, new entries are created. For class, method, 
and instance variable references, the created Java con- 
stant pool entries are based on the Dalvik constant pool 
entries. The constant pool formats differ in complex- 
ity. Specifically, Dalvik constant pool entries use sig- 
nificantly more references to reduce memory overhead. 


Method Code Retargeting: The final stage of the re- 
targeting process is the translation of the method code. 
First, we preprocess the bytecode to reorganize structures 
that cannot be directly retargeted. Second, we linearly 
traverse the DVM bytecode and translate to the JVM. 

The preprocessing phase addresses multidimensional 
arrays. Both Dalvik and Java use blocks of bytecode 
instructions to create multidimensional arrays; however, 
the instructions have different semantics and layout. ded 
reorders and annotates the bytecode with array size and 
type information for translation. 

The bytecode translation linearly processes each 
Dalvik instruction. First, ded maps each referenced reg- 
ister to a Java local variable table index. Second, ded 
performs an instruction translation for each encountered 
Dalvik instruction. As Dalvik bytecode is more compact 
and takes more arguments, one Dalvik instruction fre- 
quently expands to multiple Java instructions. Third, ded 
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patches the relative offsets used for branches based on 
preprocessing annotations. Finally, ded defines excep- 
tion tables that describe try/catch/finally blocks. 
The resulting translated code is combined with the con- 
stant pool to creates a legal Java .class file. 
The following is an example translation for add-int: 
Dalvik Java 
add-int do,so,s1 | iload sj 
iload sj 
iadd 
istore dj 
where ded creates a Java local variable for each regis- 
ter, i.e., dy + dj, so > 5p, etc. The translation creates 
four Java instructions: two to push the variables onto the 
stack, one to add, and one to pop the result. 


3.2 Optimization and Decompilation 


At this stage, the retargeted .class files can be de- 
compiled using existing tools, e.g., Fernflower [1] or 
Soot [45]. However, ded’s bytecode translation process 
yields unoptimized Java code. For example, Java tools 
often optimize out unnecessary assignments to the local 
variable table, e.g., unneeded return values. Without op- 
timization, decompiled code is complex and frustrates 
analysis. Furthermore, artifacts of the retargeting pro- 
cess can lead to decompilation errors in some decompil- 
ers. The need for bytecode optimization is easily demon- 
strated by considering decompiled loops. Most decom- 
pilers convert for loops into infinite loops with break 
instructions. While the resulting source code is func- 
tionally equivalent to the original, it is significantly more 
difficult to understand and analyze, especially for nested 
loops. Thus, we use Soot as a post-retargeting optimizer. 
While Soot is centrally an optimization tool with the abil- 
ity to recover source code in most cases, it does not pro- 
cess certain legal program idioms (bytecode structures) 
generated by ded. In particular, we encountered two 
central problems involving, |) interactions between syn- 
chronized blocks and exception handling, and 2) com- 
plex control flows caused by break statements. While the 
Java bytecode generated by ded is legal, the source code 
failure rate reported in the following section is almost en- 
tirely due to Soot’s inability to extract source code from 
these two cases. We will consider other decompilers in 
future work, e.g., Jad [4], JD [3], and Fernflower [1]. 


3.3. Source Code Recovery Validation 


We have performed extensive validation testing of 
ded [35]. The included tests recovered the source code 
for small, medium and large open source applications 
and found no errors in recovery. In most cases the recov- 
ered code was virtually indistinguishable from the origi- 
nal source (modulo comments and method local-variable 
names, which are not included in the bytecode). 


USENIX Association 


Table 1: Studied Applications (from Android Market) 


























Total |Retargeted|Decompiled 

Category Classes| Classes Classes LOC 

Comics 5627| 99.54% 94.72% 415625 
Communication} 23000] 99.12% 92.32% 1832514 
Demo 8012} 99.90% 94.75% 830471 
Entertainment 10300] 99.64% 95.39% 709915 
Finance 18375} 99.34% 94.29% 1556392 
Games (Arcade)} 8508] 99.27% 93.16% 766045 
Games (Puzzle) | 9809} 99.38% 94.58% 727642 
Games (Casino) | 10754} 99.39% 93.38% 985423 
Games (Casual)} 8047] 99.33% 93.69% 681429 
Health 11438} 99.55% 94.69% 847511 
Lifestyle 9548] 99.69% 95.30% 778446 
Multimedia 15539} 99.20% 93.46% 1323805 
News/Weather 14297) 99.41% 94.52% 1123674 
Productivity 14751) 99.25% 94.87% 1443600 
Reference 10596] 99.69% 94.87% 887794 
Shopping 15771} 99.64% 96.25% 1371351 
Social 23188} 99.57% 95.23% 2048177 
Libraries 2748| 99.45% 94.18% 182655 
Sports 8509| 99.49% 94.44% 651881 
Themes 4806} 99.04% 93.30% 310203 
Tools 9696| 99.28% 95.29% 839866 
Travel 18791) 99.30% 94.47% 1419783 
Total 262110} 99.41% 94.41% |21734202 




















We also used ded to recover the source code for the 
top 50 free applications (as listed by the Android Market) 
from each of the 22 application categories—1,100 in to- 
tal. The application images were obtained from the mar- 
ket using a custom retrieval tool on September 1, 2010. 
Table | lists decompilation statistics. The decompilation 
of all 1,100 applications took 497.7 hours (about 20.7 
days) of compute time. Soot dominated the processing 
time: 99.97% of the total time was devoted to Soot opti- 
mization and decompilation. The decompilation process 
was able to recover over 247 thousand classes spread 
over 21.7 million lines of code. This represents about 
94% of the total classes in the applications. All decom- 
pilation errors are manifest during/after decompilation, 
and thus are ignored for the study reported in the latter 
sections. There are two categories of failures: 


Retargeting Failures. 0.59% of classes were not retar- 
geted. These errors fall into three classes: a) unresolved 
references which prevent optimization by Soot, b) type 
violations caused by Android’s dex compiler and c) ex- 
tremely rare cases in which ded produces illegal byte- 
code. Recent efforts have focused on improving opti- 
mization, as well as redesigning ded with a formally de- 
fined type inference apparatus. Parallel work on improv- 
ing ded has been able to reduce these errors by a third, 
and we expect further improvements in the near future. 


Decompilation Failures. 5% of the classes were suc- 
cessfully retargeted, but Soot failed to recover the source 
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code. Here we are limited by the state of the art in de- 
compilation. In order to understand the impact of de- 
compiling ded retargeted classes verses ordinary Java 
.class files, we performed a parallel study to evaluate 
Soot on Java applications generated with traditional Java 
compilers. Of 31,553 classes from a variety of packages, 
Soot was able to decompile 94.59%, indicating we can- 
not do better while using Soot for decompilation. 

A possible way to improve this is to use a different de- 
compiler. Since our study, Fernflower [1] was available 
for a short period as part of a beta test. We decompiled 
the same 1,100 optimized applications using Fernflower 
and had a recovery rate of 98.04% of the 1.65 million 
retargeted methods—a significant improvement. Future 
studies will investigate the fidelity of Fernflower’s output 
and its appropriateness as input for program analysis. 


4 Evaluating Android Security 


Our Android application study consisted of a broad range 
of tests focused on three kinds of analysis: a) exploring 
issues uncovered in previous studies and malware advi- 
sories, b) searching for general coding security failures, 
and c) exploring misuse/security failures in the use of 
Android framework. The following discusses the pro- 
cess of identifying and encoding the tests. 


4.1 Analysis Specification 


We used four approaches to evaluate recovered source 
code: control flow analysis, data flow analysis, struc- 
tural analysis, and semantic analysis. Unless otherwise 
specified, all tests used the Fortify SCA [2] static anal- 
ysis suite, which provides these four types of analysis. 
The following discusses the general application of these 
approaches. The details for our analysis specifications 
can be found in the technical report [15]. 


Control flow analysis. _ Control flow analysis imposes 
constraints on the sequences of actions executed by an 
input program P, classifying some of them as errors. Es- 
sentially, a control flow rule is an automaton A whose 
input words are sequences of actions of P—i.e., the rule 
monitors executions of P. An erroneous action sequence 
is one that drives A into a predefined error state. To stat- 
ically detect violations specified by A, the program anal- 
ysis traces each control flow path in the tool’s model of 
P, synchronously “executing” A on the actions executed 
along this path. Since not all control flow paths in the 
model are feasible in concrete executions of P, false pos- 
itives are possible. False negatives are also possible in 
principle, though uncommon in practice. Figure 4 shows 
an example automaton for sending intents. Here, the er- 
ror state is reached if the intent contains data and is sent 
unprotected without specifying the target component, re- 
sulting in a potential unintended information leakage. 
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targeted 


error 

p1 =i.$new_class(...) 

p2 =i.$new(...) | 
i.S$new_action(...) 

p3 = i.$set_class(...) | 
i.$6set_component(...) 

p4 = i.$put_extra(...) 

p5 = i.$set_class(...) | 
i.$6set_component(...) 

p6 = $unprotected_send(i) | 
$protected_send(i, null) 





empty has_data 


Figure 4: Example control flow specification 


Data flow analysis. Data flow analysis permits the 
declarative specification of problematic data flows in the 
input program. For example, an Android phone contains 
several pieces of private information that should never 
leave the phone: the user’s phone number, IMEI (device 
ID), IMSI (subscriber ID), and ICC-ID (SIM card serial 
number). In our study, we wanted to check that this infor- 
mation is not leaked to the network. While this property 
can in principle be coded using automata, data flow spec- 
ification allows for a much easier encoding. The specifi- 
cation declaratively labels program statements matching 
certain syntactic patterns as data flow sources and sinks. 
Data flows between the sources and sinks are violations. 


Structural analysis. Structural analysis allows for 
declarative pattern matching on the abstract syntax of 
the input source code. Structural analysis specifications 
are not concerned with program executions or data flow, 
therefore, analysis is local and straightforward. For ex- 
ample, in our study, we wanted to specify a bug pattern 
where an Android application mines the device ID of the 
phone on which it runs. This pattern was defined using 
a structural rule that stated that the input program called 
a method getDeviceld() whose enclosing class was an- 
droid.telephony. TelephonyManager. 


Semantic analysis. Semantic analysis allows the specifi- 
cation of a limited set of constraints on the values used by 
the input program. For example, a property of interest in 
our study was that an Android application does not send 
SMS messages to hard-coded targets. To express this 
property, we defined a pattern matching calls to Android 
messaging methods such as sendTextMessage(). Seman- 
tic specifications permit us to directly specify that the 
first parameter in these calls (the phone number) is not 
a constant. The analyzer detects violations to this prop- 
erty using constant propagation techniques well known 
in program analysis literature. 


4.2 Analysis Overview 


Our analysis covers both dangerous functionality and 
vulnerabilities. Selecting the properties for study was a 
significant challenge. For brevity, we only provide an 
overview of the specifications. The technical report [15] 
provides a detailed discussion of specifications. 
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Misuse of Phone Identifiers (Section 5.1.1). Previous 
studies [14, 12] identified phone identifiers leaking to re- 
mote network servers. We seek to identify not only the 
existence of data flows, but understand why they occur. 


Exposure of Physical Location (Section 5.1.2). Previous 
studies [14] identified location exposure to advertisement 
servers. Many applications provide valuable location- 
aware utility, which may be desired by the user. By man- 
ually inspecting code, we seek to identify the portion of 
the application responsible for the exposure. 


Abuse of Telephony Services (Section 5.2.1). | Smart- 
phone malware has sent SMS messages to premium-rate 
numbers. We study the use of hard-coded phone num- 
bers to identify SMS and voice call abuse. 


Eavesdropping on Audio/Video (Section 5.2.2). Audio 
and video eavesdropping is a commonly discussed smart- 
phone threat [41]. We examine cases where applications 
record audio or video without control flows to UI code. 


Botnet Characteristics (Sockets) (Section 5.2.3). PC 
botnet clients historically use non-HTTP ports and pro- 
tocols for command and control. Most applications use 
HTTP client wrappers for network connections, there- 
fore, we examine Socket use for suspicious behavior. 


Harvesting Installed Applications (Section 5.2.4). The 
list of installed applications is a valuable demographic 
for marketing. We survey the use of APIs to retrieve this 
list to identify harvesting of installed applications. 


Use of Advertisement Libraries (Section 5.3.1). Pre- 
vious studies [14, 12] identified information exposure to 
ad and analytics networks. We survey inclusion of ad and 
analytics libraries and the information they access. 


Dangerous Developer Libraries (Section 5.3.2). During 
our manual source code inspection, we observed danger- 
ous functionality replicated between applications. We re- 
port on this replication and the implications. 


Android-specific Vulnerabilities (Section 5.4). We 
search for non-secure coding practices [17, 10], includ- 
ing: writing sensitive information to logs, unprotected 
broadcasts of information, IPC null checks, injection at- 
tacks on intent actions, and delegation. 


General Java Application Vulnerabilities. We look for 
general Java application vulnerabilities, including mis- 
use of passwords, misuse of cryptography, and tradi- 
tional injection vulnerabilities. Due to space limitations, 
individual results for the general vulnerability analysis 
are reported in the technical report [15]. 


5 Application Analysis Results 


In this section, we document the program analysis results 
and manual inspection of identified violations. 
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Table 2: Access of Phone Identifier APIs 





























Identifier # Calls | #Apps | #w/ Permission* 
Phone Number 167 129 105 

IMEI 378 216 184? 

IMSI 38 30 27 
ICC-ID 33 21 21 

Total Unique = 246 210° 

* Defined as having the READ_PHONE_STATE permission. 
* Only 1 app did not also have the INTERNET permission. 


5.1 Information Misuse 


In this section, we explore how sensitive information is 
being leaked [12, 14] through information sinks includ- 
ing OutputStream objects retrieved from URLConnec- 
tions, HTTP GET and POST parameters in HttpClient 
connections, and the string used for URL objects. Future 
work may also include SMS as a sink. 


5.1.1 Phone Identifiers 


We studied four phone identifiers: phone number, IMEI 
(device identifier), IMSI (subscriber identifier), and ICC- 
ID (SIM card serial number). We performed two types of 
analysis: a) we scanned for APIs that access identifiers, 
and b) we used data flow analysis to identify code capa- 
ble of sending the identifiers to the network. 

Table 2 summarizes APIs calls that receive phone 
identifiers. In total, 246 applications (22.4%) included 
code to obtain a phone identifier; however, only 210 of 
these applications have the READ_PHONE_STATE permis- 
sion required to obtain access. Section 5.3 discusses code 
that probes for permissions. We observe from Table 2 
that applications most frequently access the IMEI (216 
applications, 19.6%). The phone number is used second 
most (129 applications, 11.7%). Finally, the IMSI and 
ICC-ID are very rarely used (less than 3%). 

Table 3 indicates the data flows that exfiltrate phone 
identifiers. The 33 applications have the INTERNET 
permission, but | application does not have the READ_ 
PHONE_STATE permission. We found data flows for all 
four identifier types: 25 applications have IMEI data 
flows; 10 applications have phone number data flows; 
5 applications have IMSI data flows; and 4 applications 
have ICC-ID data flows. 

To gain a better understanding of how phone identi- 
fiers are used, we manually inspected all 33 identified ap- 
plications, as well as several additional applications that 
contain calls to identifier APIs. We confirmed exfiltration 
for all but one application. In this case, code complexity 
hindered manual confirmation; however we identified a 
different data flow not found by program analysis. The 
analysis informs the following findings. 


Finding 1 - Phone identifiers are frequently leaked 
through plaintext requests. Most sinks are HTTP 
GET or POST parameters. HTTP parameter names 
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Table 3: Detected Data Flows to Network Sinks 














Phone Identifiers Location Info. 
Sink # Flows | #Apps | #Flows | # Apps 
OutputStream 10 9 0 0 
HttpClient Param 24 9 12 4 
URL Object 59 19 49 10 
Total Unique - 33 - 13 




















for the IMEI include: “uid,” “user-id,’ “imei,” “devi- 
celd,” “deviceSerialNumber,” “devicePrint,’ “X-DSN,’ 
and “uniquely_code”; phone number names include 
“phone” and “mdn”; and IMSI names include “did” and 
“imsi.” In one case we identified an HTTP parameter for 
the ICC-ID, but the developer mislabeled it “imei.” 


Finding 2 - Phone identifiers are used as device fin- 
gerprints. Several data flows directed us towards code 
that reports not only phone identifiers, but also other 
phone properties to a remote server. For example, a wall- 
paper application (com.eoeandroid.eWallpapers.cartoon) 
contains a class named SyncDevicelnfosService that col- 
lects the IMEI and attributes such as the OS ver- 
sion and device hardware. The method sendDevice- 
Infos() sends this information to a server. In an- 
other application (com.avantar.wny), the method Phon- 
eStats.toUrlFormatedString() creates a URL parameter 
string containing the IMEI, device model, platform, and 
application name. While the intent is not clear, such fin- 
gerprinting indicates that phone identifiers are used for 
more than a unique identifier. 


Finding 3 - Phone identifiers, specifically the IMEI, 
are used to track individual users. Several 
applications contain code that binds the IMEI as 
a unique identifier to network requests. For ex- 
ample, some applications (e.g. com.Qunar and 
com.nextmobileweb.craigsphone) appear to bundle the 
IMEI in search queries; in a travel application 
(com.visualit.tubeLondonCity), the method refreshLive- 
Info() includes the IMEI in a URL; and a “keyring” appli- 
cation (com.froogloid.kring. google.zxing.client.android) 
appends the IMEI to a variable named _ retailer- 
LookupCmd. We also found functionality that in- 
cludes the IMEI when checking for updates (e.g., 
com.webascender.callerid, which also includes the 
phone number) and retrieving advertisements (see Find- 
ing 6). Furthermore, we found two applications 
(com.taobo.tao and raker.duobao.store) with network ac- 
cess wrapper methods that include the IMEI for all con- 
nections. These behaviors indicate that the IMEI is used 
as a form of “tracking cookie”. 

Finding 4 - The IMEI is tied to personally identifi- 
able information (PII). The common belief that the 
IMEI to phone owner mapping is not visible outside 
the cellular network is no longer true. In several 
cases, we found code that bound the IMEI to account 
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information and other PII. For example, applications 
(e.g. com.slacker.radio and com.statefarm.pocketagent) 
include the IMEI in account registration and login re- 
quests. In another application (com.amazon.mp3), the 
method linkDevice() includes the IMEI. Code inspec- 
tion indicated that this method is called when the user 
chooses to “Enter a claim code” to redeem gift cards. 
We also found IMEI use in code for sending comments 
and reporting problems (e.g., com.morbe.guarder and 
com.fm207.discount). Finally, we found one application 
(com.andoop.highscore) that appears to bundle the IMEI 
when submitting high scores for games. Thus, it seems 
clear that databases containing mappings between phys- 
ical users and IMEIs are being created. 


Finding 5 - Not all phone identifier use leads to exfiltra- 
tion. Several applications that access phone identifiers 
did not exfiltrate the values. For example, one applica- 
tion (com.amazon.kindle) creates a device fingerprint for 
a verification check. The fingerprint is kept in “secure 
storage” and does not appear to leave the phone. An- 
other application (com.match.android.matchmobile) as- 
signs the phone number to a text field used for account 
registration. While the value is sent to the network dur- 
ing registration, the user can easily change or remove it. 


Finding 6 - Phone identifiers are sent to advertise- 
ment and analytics servers. Many applications have 
custom ad and analytics functionality. For example, 
in one application (com.accuweather.android), the class 
ACCUWX_AdRequest is an IME] data flow sink. Another 
application (com.amazon.mp3) defines Android service 
component AndroidMetricsManager, which is an IMEI 
data flow sink. Phone identifier data flows also occur 
in ad libraries. For example, we found a phone num- 
ber data flow sink in the com/wooboo/adlib_android 
library used by several applications (e.g., cn.ecook, 
com.superdroid.sqd, and com.superdroid.ewc). Sec- 
tion 5.3 discusses ad libraries in more detail. 


5.1.2 Location Information 


Location information is accessed in two ways: (1) calling 
getLastKnownLocation(), and (2) defining callbacks in 
a LocationListener object passed to requestLocationUp- 
dates(). Due to code recovery failures, not all Location- 
Listener objects have corresponding requestLocationUp- 
dates() calls. We scanned for all three constructs. 

Table 4 summarizes the access of location informa- 
tion. In total, 505 applications (45.9%) attempt to access 
location, only 304 (27.6%) have the permission to do so. 
This difference is likely due to libraries that probe for 
permissions, as discussed in Section 5.3. The separa- 
tion between LocationListener and requestLocationUp- 
dates() is primarily due to the AdMob library, which de- 
fined the former but has no calls to the latter. 
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Table 4: Access of Location APIs 














Identifier # Uses | # Apps | #w/ Perm.* 
getLastKnownLocation 428 204 148 
LocationListener 652 469 282 
requestLocationUpdates 316 146 128 
Total Unique - 505 304" 

















* Defined as having a LOCATION permission. 
* In total, 5 apps did not also have the INTERNET permission. 


Table 3 shows detected location data flows to the net- 

work. To overcome missing code challenges, the data 
flow source was defined as the getLatitude() and getLon- 
gitude() methods of the Location object retrieved from 
the location APIs. We manually inspected the 13 appli- 
cations with location data flows. Many data flows ap- 
peared to reflect legitimate uses of location for weather, 
classifieds, points of interest, and social networking ser- 
vices. Inspection of the remaining applications informs 
the following findings: 
Finding 7 - The granularity of location reporting may 
not always be obvious to the user. In one applica- 
tion (com.andoop.highscore) both the city/country and 
geographic coordinates are sent along with high scores. 
Users may be aware of regional geographic information 
associated with scores, but it was unclear if users are 
aware that precise coordinates are also used. 


Finding 8 - Location information is sent to advertise- 
ment servers. Several location data flows appeared to 
terminate in network connections used to retrieve ads. 
For example, two applications (com.avantar.wny and 
com.avantar.yp) appended the location to the variable 
webAdURLString. Motivated by [14], we inspected the 
AdMob library to determine why no data flow was found 
and determined that source code recovery failures led to 
the false negatives. Section 5.3 expands on ad libraries. 


5.2 Phone Misuse 


This section explores misuse of the smartphone inter- 
faces, including telephony services, background record- 
ing of audio and video, sockets, and accessing the list of 
installed applications. 


5.2.1 Telephony Services 


Smartphone malware can provide direct compensation 
using phone calls or SMS messages to premium-rate 
numbers [18, 25]. We defined three queries to identify 
such malicious behavior: (1) a constant used for the SMS 
destination number; (2) creation of URI objects with a 
“tel:” prefix (used for phone call intent messages) and 
the string “900” (a premium-rate number prefix in the 
US); and (3) any URI objects with a “tel:” prefix. The 
analysis informs the following findings. 


Finding 9 - Applications do not appear to be using fixed 
phone number services. We found zero applications us- 
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ing a constant destination number for the SMS API. 
Note that our analysis specification is limited to constants 
passed directly to the API and final variables, and there- 
fore may have false negatives. We found two applica- 
tions creating URI objects with the “tel:” prefix and 
containing the string “900”. One application included 
code to call “tel: //0900-9292”, which is a premium- 
rate number (€0.70 per minute) for travel advice in the 
Netherlands. However, this did not appear malicious, as 
the application (com.Planner92972) is designed to provide 
travel advice. The other application contained several 
hard-coded numbers with “900” in the last four digits 
of the number. The SMS and premium-rate analysis re- 
sults are promising indicators for non-existence of ma- 
licious behavior. Future analysis should consider more 
premium-rate prefixes. 


Finding 10 - Applications do not appear to be misus- 
ing voice services. | We found 468 URI objects with 
the “tel:” prefix in 358 applications. We manually 
inspected a sample of applications to better understand 
phone number use. We found: (1) applications fre- 
quently include call functionality for customer service; 
(2) the “CALL” and “DIAL” intent actions were used 
equally for the same purpose (CALL calls immediately 
and requires the CALL_PHONE permission, whereas DIAL 
has user confirmation the dialer and requires no permis- 
sion); and (3) not all hard-coded telephone numbers are 
used to make phone calls, e.g., the AdMob library had a 
apparently unused phone number hard coded. 


5.2.2 Background Audio/Video 


Microphone and camera eavesdropping on smartphones 
is a real concern [41]. We analyzed application eaves- 
dropping behaviors, specifically: (1) recording video 
without calling setPreviewDisplay() (this API is always 
required for still image capture); (2) AudioRecord. read() 
in code not reachable from an Android activity compo- 
nent; and (3) MediaRecorder:start() in code not reach- 
able from an activity component. 


Finding 11 - Applications do not appear to be misusing 
video recording. We found no applications that record 
video without calling setPreviewDisplay(). The query 
reasonably did not consider the value passed to the pre- 
view display, and therefore may create false negatives. 
For example, the “preview display” might be one pixel 
in size. The MediaRecorder.start() query detects audio 
recording, but it also detects video recording. This query 
found two applications using video in code not reachable 
from an activity; however the classes extended Surface- 
View, which is used by setPreviewDisplay(). 


Finding 12 - Applications do not appear to be misus- 
ing audio recording. We found eight uses in seven ap- 
plications of AudioRecord.read() without a control flow 
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path to an activity component. Of these applications, 
three provide VoIP functionality, two are games that re- 
peat what the user says, and one provides voice search. 
In these applications, audio recording is expected; the 
lack of reachability was likely due to code recovery fail- 
ures. The remaining application did not have the required 
RECORD_AUDIO permission and the code most likely was 
part of a developer toolkit. The MediaRecorder start() 
query identified an additional five applications recording 
audio without reachability to an activity. Three of these 
applications have legitimate reasons to record audio: 
voice search, game interaction, and VoIP. Finally, two 
games included audio recording in a developer toolkit, 
but no record permission, which explains the lack of 
reachability. Section 5.3.2 discusses developer toolkits. 


5.2.3 Socket API Use 


Java sockets represent an open interface to external ser- 
vices, and thus are a potential source of malicious be- 
havior. For example, smartphone-based botnets have 
been found to exist on “jailbroken” iPhones [8]. We ob- 
serve that most Internet-based smartphone applications 
are HTTP clients. Android includes useful classes (e.g., 
HttpURLConnection and HttpClient) for communicating 
with Web servers. Therefore, we queried for applications 
that make network connections using the Socket class. 


Finding 13 - A small number of applications include 
code that uses the Socket class directly. We found 
177 Socket connections in 75 applications (6.8%). Many 
applications are flagged for inclusion of well-known 
network libraries such as org/apache/thrift, org/ 
apache/commons, and org/eclipse/jetty, which 
use sockets directly. Socket factories were also detected. 
Identified factory names such as TrustAllSSLSocket- 
Factory, AllTrustSSLSocketFactory, and NonValidat- 
ingSSLSocketFactory are interesting as potential vulnera- 
bilities, but we found no evidence of malicious use. Sev- 
eral applications also included their own HTTP wrapper 
methods that duplicate functionality in the Android li- 
braries, but did not appear malicious. Among the appli- 
cations including custom network connection wrappers 
is a group of applications in the “Finance” category im- 
plementing cryptographic network protocols (e.g., in the 
com/lumensoft/ks library). We note that these appli- 
cations use Asian character sets for their market descrip- 
tions, and we could not determine their exact purpose. 


Finding 14 - We found no evidence of malicious behav- 
ior by applications using Socket directly. | We manu- 
ally inspected all 75 applications to determine if Socket 
use seemed appropriate based on the application descrip- 
tion. Our survey yielded a diverse array of Socket uses, 
including: file transfer protocols, chat protocols, au- 
dio and video streaming, and network connection tether- 
ing, among other uses excluded for brevity. A handful 
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of applications have socket connections to hard-coded 
IP address and non-standard ports. For example, one 
application (com.eingrad.vintagecomicdroid) downloads 
comics from 208.94.242.218 on port 2009. Addition- 
ally, two of the aforementioned financial applications 
(com.miraeasset.mstock and kvp.jjy.MispAndroid320) 
include the kr/co/shiftworks library that connects to 
221.143.48.118 on port 9001. Furthermore, one applica- 
tion (com.tf1.lci) connects to 209.85.227.147 on port 80 
inaclass named AdService and subsequently calls getLo- 
calAddress() to retrieve the phone’s IP address. Overall, 
we found no evidence of malicious behavior, but several 
applications warrant deeper investigation. 


5.2.4 Installed Applications 


The list of installed applications provides valuable mar- 
keting data. Android has two relevant APIs types: (1) 
a set of get APIs returning the list of installed applica- 
tions or package names; and (2) a set of query APIs that 
mirrors Android’s runtime intent resolution, but can be 
made generic. We found 54 uses of the get APIs in 45 
applications, and 1015 uses of the query APIs in 361 ap- 
plications. Sampling these applications, we observe: 


Finding 15 - Applications do not appear to be har- 
vesting information about which applications are in- 
stalled on the phone. In all but two cases, 
the sampled applications using the get APIs search 
the results for a specific application. One applica- 
tion (com.davidgoemans.simpleClock Widget) defines a 
method that returns the list of all installed applications, 
but the results were only displayed to the user. The 
second application (raker.duobao.store) defines a simi- 
lar method, but it only appears to be called by unused 
debugging code. Our survey of the query APIs identi- 
fied three calls within the AdMob library duplicated in 
many applications. These uses queried specific function- 
ality and thus are not likely to harvest application infor- 
mation. The one non-AdMob application we inspected 
queried for specific functionality, e.g., speech recogni- 
tion, and thus did not appear to attempt harvesting. 


5.3 Included Libraries 


Libraries included by applications are often easy to iden- 
tify due to namespace conventions: i.e., the source 
code for com.foo.appname typically exists in com/foo/ 
appname. During our manual inspection, we docu- 
mented advertisement and analytics library paths. We 
also found applications sharing what we term “developer 
toolkits,” i.e., a common set of developer utilities. 


5.3.1 Advertisement and Analytics Libraries 


We identified 22 library paths containing ad or analytics 
functionality. Sampled applications frequently contained 
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Table 5: Identified Ad and Analytics Library Paths 














Library Path # Apps | Format | Obtains* 
com/admob/android/ads 320 Obf. ib 
com/google/ads 206 Plain - 
com/flurry/android 98 Obf. - 
com/qwapi/adclient/android 74 Plain L,P,E 
com/google/android/apps/analytics 67 Plain - 
com/adwhirl 60 Plain i 
com/mobclix/android/sdk 58 Plain L, Et 
com/millennialmedia/android a2 Plain - 
com/zestadz/android 10 Plain - 
com/admarvel/android/ads 8 Plain - 
com/estsoft/adlocal 8 Plain L 
com/adfonic/android a Obf. - 
com/vdroid/ads 5 Obf. LE 
com/greystripe/android/sdk 4 Obf. E 
com/medialets 4 Obf. is 
com/wooboo/adlib_android 4 Obf. L,P It 
com/adserver/adview 3 Obf. is 
com/tapjoy 3 Plain - 
com/inmobi/androidsdk 2 Plain Et 
com/apegroup/ad 1 Plain 
com/casee/adsdk 1 Plain S 
com/webtrends/mobile 1 Plain L,E, S, 
Total Unique Apps 561 - - 

















* L = Location; P = Phone number; E = IMEI; S = IMSI; I = ICC-ID 
tIn 1 app, the library included “L”, while the other 3 included “P, I’. 
* Direct API use not decompiled, but wrapper .getDeviceld() called. 


multiple of these libraries. Using the paths listed in Ta- 
ble 5, we found: | app has 8 libraries; 10 apps have 7 li- 
braries; 8 apps have 6 libraries; 15 apps have 5 libraries; 
37 apps have 4 libraries; 32 apps have 3 libraries; 91 apps 
have 2 libraries; and 367 apps have 1 library. 

Table 5 shows advertisement and analytics library use. 
In total, at least 561 applications (51%) include these 
libraries; however, additional libraries may exist, and 
some applications include custom ad and analytics func- 
tionality. The AdMob library is used most pervasively, 
existing in 320 applications (29.1%). Google Ads is used 
by 206 applications (18.7%). We observe from Table 5 
that only a handful of libraries are used pervasively. 

Several libraries access phone identifier and location 
APIs. Given the library purpose, it is easy to specu- 
late data flows to network APIs. However, many of 
these flows were not detected by program analysis. This 
is (likely) a result of code recovery failures and flows 
through Android IPC. For example, AdMob has known 
location to network data flows [14], and we identified 
a code recovery failure for the class implementing that 
functionality. Several libraries are also obfuscated, as 
mentioned in Section 6. Interesting, 6 of the 13 li- 
braries accessing sensitive information are obfuscated. 
The analysis informs the following additional findings. 


Finding 16 - Ad and analytics library use of phone iden- 
tifiers and location is sometimes configurable. The 
com/webtrends/mobile analytics library (used by 
com.statefarm.pocketagent), defines the WebtrendsId- 
Method class specifying four identifier types. Only one 
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type, “system_id_extended” uses phone identifiers (IMEI, 
IMSI, and ICC-ID). It is unclear which identifier type 
was used by the application. Other libraries provide sim- 
ilar configuration. For example, the AdMob SDK docu- 
mentation [6] indicates that location information is only 
included if a package manifest configuration enables it. 
Finding 17 - Analytics library reporting frequency is of- 
ten configurable. During manual inspection, we encoun- 
tered one application (com.handmark.mpp.news.reuters) 
in which the phone number is passed to FlurryA- 
gent.onEvent() as generic data. This method is called 
throughout the application, specifying event labels such 
as “GetMoreStories,” “StoryClickedFromList,” and “Im- 
ageZoom.” Here, we observe the main application code 
not only specifies the phone number to be reported, but 
also report frequency. 


Finding 18 - Ad and analytics libraries probe for permis- 
sions. The com/webtrends/mobile library accesses 
the IMEI, IMSI, ICC-ID, and location. The (Webtrend- 
sAndroidValueFetcher) class uses try/catch blocks that 
catch the SecurityException that is thrown when an appli- 
cation does not have the proper permission. Similar func- 
tionality exists in the com/casee/adsdk library (used 
by com.fish.luny). In AdFetchergetDeviceld(), An- 
droid’s checkCallingOrSelfPermission() method is eval- 
uated before accessing the IMSI. 


5.3.2 Developer Toolkits 


Several inspected applications use developer toolkits 
containing common sets of utilities identifiable by class 
name or library path. We observe the following. 


Finding 19 - Some developer toolkits replicate dan- 
gerous functionality. We found three wallpaper 
applications by developer “callmejack” that include 
utilities in the library path com/jackeeywu/apps/ 
eWallpaper (com.eoeandroid.eWallpapers.cartoon, 
com.jackeey.wallpapers.alll.orange, and com.jackeey. 
eWallpapers.gundam). This library has data flow sinks 
for the phone number, IMEI, IMSI, and ICC-ID. In July 
2010, Lookout, Inc. reported a wallpaper application 
by developer “jackeey,wallpaper” as sending these 
identifiers to imnet .us [29]. This report also indicated 
that the developer changed his name to “callmejack”’. 
While the original “jackeey,wallpaper” application was 
removed from the Android Market, the applications by 
“callmejack” remained as of September 2010.7 


Finding 20 - Some developer toolkits probe for permis- 
In one application (com.july.cbssports.activity), 
we found code in the com/julysystems library that 
evaluates Android’s checkPermission() method for the 
READ_PHONE_STATE and ACCESS_FINE_LOCATION per- 
missions before accessing the IMEI, phone number, and 
last known location, respectively. A second application 


sions. 
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(v00032.com.wordplayer) defines the CustomException- 
Hander class to send an exception event to an HTTP 
URL. The class attempts to retrieve the phone num- 
ber within a try/catch block, catching a generic Ex- 
ception. However, the application does not have the 
READ_PHONE_STATE permission, indicating the class is 
likely used in multiple applications. 


Finding 21 - Well-known brands sometimes commis- 
sion developers that include dangerous functional- 
ity. The com/julysystems developer toolkit iden- 
tified as probing for permissions exists in two appli- 
cations with reputable application providers. “CBS 
Sports Pro Football” (com.july.cbssports.activity) is pro- 
vided by “CBS Interactive, Inc.”, and “Univision Fiitbol” 
(com.july.univision) is provided by “Univision Interac- 
tive Media, Inc.”. Both have location and phone state 
permissions, and hence potentially misuse information. 

Similarly, “USA TODAY” (com.usatoday.android. 
news) provided by “USA TODAY” and “FOX News” 
(com.foxnews.android) provided by “FOX News Net- 
work, LLC” contain the com/mercuryintermedia 
toolkit. Both applications contain an Android ac- 
tivity component named MainActivity. In the ini- 
tialization phase, the IMEI is retrieved and passed 
to ProductConfiguration.initialize() (part of the com/ 
mecuryintermedia toolkit). Both applications have 
IMEI to network data flows through this method. 


5.4 Android-specific Vulnerabilities 


This section explores Android-specific vulnerabilities. 
The technical report [15] provides specification details. 


5.4.1 Leaking Information to Logs 


Android provides centralized logging via the Log API, 
which can displayed with the “logcat” command. 
While logcat is a debugging tool, applications with the 
READ_LOGS permission can read these log messages. The 
Android documentation for this permission indicates that 
“[the logs] can contain slightly private information about 
what is happening on the device, but should never con- 
tain the user’s private information.” We looked for data 
flows from phone identifier and location APIs to the An- 
droid logging interface and found the following. 


Finding 22 - Private information is written to Android’s 
general logging interface. We found 253 data flows in 96 
applications for location information, and 123 flows in 
90 applications for phone identifiers. Frequently, URLs 
containing this private information are logged just before 
a network connection is made. Thus, the READ_LOGS 
permission allows access to private information. 


5.4.2 Leaking Information via IPC 


Shown in Figure 5, any application can receive intent 
broadcasts that do not specify the target component or 
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Application: pkgname Application: malicous 








malicous.BarReceiver 
- Filter: "pkgname.intent.ACTION" 


Partially Specified Intent Message 
- Action: "pkgname.intent.ACTION" 




















Fully Specified Intent Message 
- Action: "pkgname.intent.ACTION" 
- Component: "pkgname.FooReceiver" 


pkgname.FooReceiver 
- Filter: "pkgname.intent.ACTION" 














Figure 5: Eavesdropping on unprotected intents 


protect the broadcast with a permission (permission vari- 
ant not shown). This is unsafe if the intent contains sensi- 
tive information. We found 271 such unsafe intent broad- 
casts with “extras” data in 92 applications (8.4%). Sam- 
pling these applications, we found several such intents 
used to install shortcuts to the home screen. 


Finding 23 - Applications broadcast private informa- 
tion in IPC accessible to all applications. | We found 
many cases of applications sending unsafe intents to 
action strings containing the application’s namespace 
(e.g., “pkgname.intent. ACTION” for application pkg- 
name). The contents of the bundled information var- 
ied. In some instances, the data was not sensitive, 
e.g., widget and task identifiers. However, we also 
found sensitive information. For example one applica- 
tion (com.ulocate) broadcasts the user’s location to the 
“com.ulocate.service. LOCATION” intent action string 
without protection. Another application (com.himsn) 
broadcasts the instant messaging client’s status to the 
“cm.mz.stS” action string. These vulnerabilities allow 
malicious applications to eavesdrop on sensitive infor- 
mation in IPC, and in some cases, gain access to infor- 
mation that requires a permission (e.g., location). 


5.4.3 Unprotected Broadcast Receivers 


Applications use broadcast receiver components to re- 
ceive intent messages. Broadcast receivers define “intent 
filters” to subscribe to specific event types are public. If 
the receiver is not protected by a permission, a malicious 
application can forge messages. 


Finding 24 - Few applications are vulnerable to forg- 
ing attacks to dynamic broadcast receivers. We found 
406 unprotected broadcast receivers in 154 applications 
(14%). We found an large number of receivers sub- 
scribed to system defined intent types. These receivers 
are indirectly protected by Android’s “protected broad- 
casts” introduced to eliminate forging. We found one 
application with an unprotected broadcast receiver for a 
custom intent type; however it appears to have limited 
impact. Additional sampling may uncover more cases. 


5.4.4 Intent Injection Attacks 


Intent messages are also used to start activity and service 
components. An intent injection attack occurs if the in- 
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tent address is derived from untrusted input. 

We found 10 data flows from the network to an in- 
tent address in 1 application. We could not confirm 
the data flow and classify it a false positive. The data 
flow sink exists in a class named ProgressBroadcasting- 
FileInputStream. No decompiled code references this 
class, and all data flow sources are calls to URLCon- 
nection. getInputStream(), which is used to create Input- 
StreamReader objects. We believe the false positives re- 
sults from the program analysis modeling of classes ex- 
tending InputStream. 

We found 80 data flows from IPC to an intent address 
in 37 applications. We classified the data flows by the 
sink: the Intent constructor is the sink for 13 applica- 
tions; setAction() is the sink for 16 applications; and set- 
Component() is the sink for 8 applications. These sets 
are disjoint. Of the 37 applications, we found that 17 
applications set the target component class explicitly (all 
except 3 use the setAction() data flow sink), e.g., to relay 
the action string from a broadcast receiver to a service. 
We also found four false positives due to our assumption 
that all Intent objects come from IPC (a few exceptions 
exist). For the remaining 16 cases, we observe: 


Finding 25 - Some applications define intent addresses 
based on IPC input. Three applications use IPC input 
strings to specify the package and component names for 
the setComponent() data flow sink. Similarly, one appli- 
cation uses the IPC “extras” input to specify an action to 
an Intent constructor. Two additional applications start 
an activity based on the action string returned as a result 
from a previously started activity. However, to exploit 
this vulnerability, the applications must first start a ma- 
licious activity. In the remaining cases, the action string 
used to start a component is copied directly into a new 
intent object. A malicious application can exploit this 
vulnerability by specifying the vulnerable component’s 
name directly and controlling the action string. 


5.4.5 Delegating Control 


Applications can delegate actions to other applications 
using a “pending intent.’ An application first creates an 
intent message as if it was performing the action. It then 
creates a reference to the intent based on the target com- 
ponent type (restricting how it can be used). The pend- 
ing intent recipient cannot change values, but it can fill in 
missing fields. Therefore, if the intent address is unspec- 
ified, the remote application can redirect an action that is 
performed with the original application’s permissions. 


Finding 26 - Few applications unsafely delegate actions. 
We found 300 unsafe pending intent objects in 116 appli- 
cations (10.5%). Sampling these applications, we found 
an overwhelming number of pending intents used for ei- 
ther: (1) Android’s UI notification service; (2) Android’s 
alarm service; or (3) communicating between a UI wid- 
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get and the main application. None of these cases allow 
manipulation by a malicious application. We found two 
applications that send unsafe pending intents via IPC. 
However, exploiting these vulnerabilities appears to pro- 
vides negligible adversarial advantage. We also note that 
more a more sophisticated analysis framework could be 
used to eliminate the aforementioned false positives. 


5.4.6 Null Checks on IPC Input 


Android applications frequently process information 
from intent messages received from other applications. 
Null dereferences cause an application to crash, and can 
thus be used to as a denial of service. 


Finding 27 - Applications frequently do not perform null 
checks on IPC input. We found 3,925 potential null 
dereferences on IPC input in 591 applications (53.7%). 
Most occur in classes for activity components (2,484 
dereferences in 481 applications). Null dereferences in 
activity components have minimal impact, as the appli- 
cation crash is obvious to the user. We found 746 poten- 
tial null dereferences in 230 applications within classes 
defining broadcast receiver components. Applications 
commonly use broadcast receivers to start background 
services, therefore it is unclear what effect a null deref- 
erence in a broadcast receiver will have. Finally, we 
found 72 potential null dereferences in 36 applications 
within classes defining service components. Applica- 
tions crashes corresponding to these null dereferences 
have a higher probability of going unnoticed. The re- 
maining potential null dereferences are not easily associ- 
ated with a component type. 


5.4.7 SDcard Use 


Any application that has access to read or write data on 
the SDeard can read or write any other application’s data 
on the SDcard. We found 657 references to the SDcard in 
251 applications (22.8%). Sampling these applications, 
we found a few unexpected uses. For example, the com/ 
tapjoy ad library (used by com.jnj.mocospace.android) 
determines the free space available on the SDcard. An- 
other application (com.rent) obtains a URL from a file 
named connRentInfo.dat at the root of the SDcard. 


5.4.8 JNI Use 


Applications can include functionality in native libraries 
using the Java Native Interface (JNJ). As these methods 
are not written in Java, they have inherent dangers. We 
found 2,762 calls to native methods in 69 applications 
(6.3%). Investigating the application package files, we 
found that 71 applications contain .so files. This indi- 
cates two applications with an .so file either do not call 
any native methods, or the code calling the native meth- 
ods was not decompiled. Across these 71 applications, 
we found 95 .so files, 82 of which have unique names. 
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6 Study Limitations 


Our study section was limited in three ways: a) the stud- 
ied applications were selected with a bias towards popu- 
larity; b) the program analysis tool cannot compute data 
and control flows for IPC between components; and c) 
source code recovery failures interrupt data and control 
flows. Missing data and control flows may lead to false 
negatives. In addition to the recovery failures, the pro- 
gram analysis tool could not parse 8,042 classes, reduc- 
ing coverage to 91.34% of the classes. 

Additionally, a portion of the recovered source code 
was obfuscated before distribution. Code obfuscation 
significantly impedes manual inspection. It likely exists 
to protect intellectual property; Google suggests obfus- 
cation using ProGuard (proguard.sf.net) for applica- 
tions using its licensing service [23]. ProGuard protects 
against readability and does not obfuscate control flow. 
Therefore it has limited impact on program analysis. 

Many forms of obfuscated code are easily recogniz- 
able: e.g., class, method, and field names are converted 
to single letters, producing single letter Java filenames 
(e.g., a. java). For a rough estimate on the use of obfus- 
cation, we searched applications containing a. java. In 
total, 396 of the 1,100 applications contain this file. As 
discussed in Section 5.3, several advertisement and ana- 
lytics libraries are obfuscated. To obtain a closer estimate 
of the number of applications whose main code is obfus- 
cated, we searched for a. java within a file path equiva- 
lent to the package name (e.g., com/foo/appname for 
com.foo.appname). Only 20 applications (1.8%) have 
this obfuscation property, which is expected for free ap- 
plications (as opposed to paid applications). However, 
we stress that the a. java heuristic is not intended to be 
a firm characterization of the percentage of obfuscated 
code, but rather a means of acquiring insight. 


7 What This All Means 


Identifying a singular take-away from a broad study such 
as this is non-obvious. We come away from the study 
with two central thoughts; one having to do with the 
study apparatus, and the other regarding the applications. 

ded and the program analysis specifications are en- 
abling technologies that open a new door for application 
certification. We found the approach rather effective de- 
spite existing limitations. In addition to further studies of 
this kind, we see the potential to integrate these tools into 
an application certification process. We leave such dis- 
cussions for future work, noting that such integration is 
challenging for both logistical and technical reasons [30]. 

On a technical level, we found the security character- 
istics of the top 1,100 free popular applications to be con- 
sistent with smaller studies (e.g., Enck et al. [14]). Our 
findings indicate an overwhelming concern for misuse of 
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privacy sensitive information such as phone identifiers 
and location information. One might speculate this oc- 
cur due to the difficulty in assigning malicious intent. 

Arguably more important than identifying the exis- 
tence the information misuse, our manual source code 
inspection sheds more light on how information is mis- 
used. We found phone identifiers, e.g., phone number, 
IMEI, IMSI, and ICC-ID, were used for everything from 
“cookie-esque” tracking to account numbers. Our find- 
ings also support the existence of databases external to 
cellular providers that link identifiers such as the IMEI 
to personally identifiable information. 

Our analysis also identified significant penetration of 
ad and analytic libraries, occurring in 51% of the studied 
applications. While this might not be surprising for free 
applications, the number of ad and analytics libraries in- 
cluded per application was unexpected. One application 
included as many as eight different libraries. It is unclear 
why an application needs more than one advertisement 
and one analytics library. 

From a vulnerability perspective, we found that many 
developers fail to take necessary security precautions. 
For example, sensitive information is frequently writ- 
ten to Android’s centralized logs, as well as occasionally 
broadcast to unprotected IPC. We also identified the po- 
tential for IPC injection attacks; however, no cases were 
readily exploitable. 

Finally, our study only characterized one edge of the 
application space. While we found no evidence of tele- 
phony misuse, background recording of audio or video, 
or abusive network connections, one might argue that 
such malicious functionality is less likely to occur in 
popular applications. We focused our study on popular 
applications to characterize those most frequently used. 
Future studies should take samples that span application 
popularity. However, even these samples may miss the 
existence of truly malicious applications. Future studies 
should also consider several additional attacks, including 
installing new applications [43], JNI execution [34], ad- 
dress book exfiltration, destruction of SDcard contents, 
and phishing [20]. 


8 Related Work 


Many tools and techniques have been designed to iden- 
tify security concerns in software. Software written in 
C is particularly susceptible to programming errors that 
result in vulnerabilities. Ashcraft and Engler [7] use 
compiler extensions to identify errors in range checks. 
MOPS [11] uses model checking to scale to large 
amounts of source code [42]. Java applications are in- 
herently safer than C applications and avoid simple vul- 
nerabilities such as buffer overflows. Ware and Fox [46] 
compare eight different open source and commercially 
available Java source code analysis tools, finding that 
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no one tool detects all vulnerabilities. Hovemeyer and 
Pugh [22] study six popular Java applications and li- 
braries using FindBugs extended with additional checks. 
While analysis included non-security bugs, the results 
motivate a strong need for automated analysis by all de- 
velopers. Livshits and Lam [28] focus on Java-based 
Web applications. In the Web server environment, inputs 
are easily controlled by an adversary, and left unchecked 
can lead to SQL injection, cross-site scripting, HTTP re- 
sponse splitting, path traversal, and command injection. 
Felmetsger et al. [19] also study Java-based web applica- 
tions; they advance vulnerability analysis by providing 
automatic detection of application-specific logic errors. 

Spyware and privacy breaching software have also 
been studied. Kirda et al. [26] consider behavioral prop- 
erties of BHOs and toolbars. Egele et al. [13] target 
information leaks by browser-based spyware explicitly 
using dynamic taint analysis. Panaorama [47] consid- 
ers privacy-breaching malware in general using whole- 
system, fine-grained taint tracking. Privacy Oracle [24] 
uses differential black box fuzz testing to find privacy 
leaks in applications. 

On smartphones, TaintDroid [14] uses system-wide 
dynamic taint tracking to identify privacy leaks in An- 
droid applications. By using static analysis, we were able 
to study a far greater number of applications (1,100 vs. 
30). However, TaintDroid’s analysis confirms the exfil- 
tration of information, while our static analysis only con- 
firms the potential for it. Kirin [16] also uses static anal- 
ysis, but focuses on permissions and other application 
configuration data, whereas our study analyzes source 
code. Finally, PiOS [12] performs static analysis on iOS 
applications for the iPhone. The PiOS study found the 
majority of analyzed applications to leak the device ID 
and over half of the applications include advertisement 
and analytics libraries. 


9 Conclusions 


Smartphones are rapidly becoming a dominant comput- 
ing platform. Low barriers of entry for application de- 
velopers increases the security risk for end users. In this 
paper, we described the ded decompiler for Android ap- 
plications and used decompiled source code to perform a 
breadth study of both dangerous functionality and vul- 
nerabilities. While our findings of exposure of phone 
identifiers and location are consistent with previous stud- 
ies, our analysis framework allows us to observe not only 
the existence of dangerous functionality, but also how it 
occurs within the context of the application. 

Moving forward, we foresee ded and our analysis 
specifications as enabling technologies that will open 
new doors for application certification. However, the in- 
tegration of these technologies into an application certifi- 
cation process requires overcoming logistical and techni- 
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cal challenges. Our future work will consider these chal- 
lenges, and broaden our analysis to new areas, including 
application installation, malicious JNI, and phishing. 
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Notes 


'The undx and dex2jar tools attempt to decompile .dex files, but 
were non-functional at the time of this writing. 

Note that it is sufficient to find any type-exposing instruction for 
a register assignment. Any code that could result in different types for 
the same register would be illegal. If this were to occur, the primitive 
type would be dependent on the path taken at run time, a clear violation 
of Java’s type system. 

3Fortunately, these dangerous applications are now nonfunc- 
tional, as the imnet.us NS entry is NS1.SUSPENDED-FOR. 
SPAM- AND- ABUSE.COM. 


USENIX Association 


Permission Re-Delegation: Attacks and Defenses 


Adrienne Porter Felt 
apf @ cs.berkeley.edu 
University of California, Berkeley 


Helen J. Wang, Alexander Moshchuk 
{helenw, alexmos}@microsoft.com 
Microsoft Research 


Steven Hanna, Erika Chin 
{sch, emc} @cs.berkeley.edu 
University of California, Berkeley 


Abstract 


Modern browsers and smartphone operating systems 
treat applications as mutually untrusting, potentially ma- 
licious principals. Applications are (1) isolated ex- 
cept for explicit IPC or inter-application communica- 
tion channels and (2) unprivileged by default, requir- 
ing user permission for additional privileges. Although 
inter-application communication supports useful collab- 
oration, it also introduces the risk of permission re- 
delegation. Permission re-delegation occurs when an ap- 
plication with permissions performs a privileged task for 
an application without permissions. This undermines the 
requirement that the user approve each application’s ac- 
cess to privileged devices and data. We discuss permis- 
sion re-delegation and demonstrate its risk by launching 
real-world attacks on Android system applications; sev- 
eral of the vulnerabilities have been confirmed as bugs. 

We discuss possible ways to address permission re- 
delegation and present IPC Inspection, a new OS mech- 
anism for defending against permission re-delegation. 
IPC Inspection prevents opportunities for permission re- 
delegation by reducing an application’s permissions after 
it receives communication from a less privileged applica- 
tion. We have implemented IPC Inspection for a browser 
and Android, and we show that it prevents the attacks we 
found in the Android system applications. 


1 Introduction 


Traditional multi-user operating systems like Windows 
and Linux associate privileges with user accounts. When 
a user installs an application, the application runs in the 
name of the user and inherits the user’s ability to access 
system resources (e.g., the camera). Browsers and smart- 
phone operating systems, however, have shifted to a fun- 
damentally new model where applications are treated as 
potentially malicious and mutually distrusting. Princi- 
pals receive few privileges by default and are isolated 
from one another except for communication through ex- 
plicit IPC channels. Only the user can grant individual 
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applications permission to use devices and access user- 
private data (e.g., location) through system APIs. Conse- 
quently, each application has its own set of permissions, 
as granted by the user. 

IPC in a system with per-application permissions leads 
to the threat of permission re-delegation. Permission 
re-delegation occurs when an application with a user- 
controlled permission makes an API call on behalf of 
a less privileged application without user involvement. 
The privileged application is referred to as a deputy, 
wielding authority on behalf of the user. The permis- 
sion system should prevent applications from accessing 
privileged system APIs without user consent, but permis- 
sion re-delegation circumvents this rule. This violates 
the user’s expectation of safety when interacting with 
unprivileged applications. Permission re-delegation is a 
special case of the confused deputy problem [23] where 
authority is given by the user’s permission. 

We demonstrate that permission re-delegation is a re- 
alistic threat with a case study on Android applications. 
Android features per-application permissions and IPC, 
which are the necessary conditions for permission re- 
delegation vulnerabilities in applications. More than a 
third of the 872 surveyed Android applications request 
permissions for sensitive resources and also expose pub- 
lic interfaces; they are therefore at risk of facilitating 
permission re-delegation. We find 15 permission re- 
delegation vulnerabilities in 5 core system applications. 

The threat of permission re-delegation is particularly 
important for web browsers, which are just beginning to 
add APIs that provide websites with access to devices 
and geolocation [32]. Additionally, an IPC primitive 
named postMessage has been widely deployed over 
the past few years, facilitating interaction between appli- 
cations. Although device access for web applications is 
not yet widespread in 2011, permission re-delegation at- 
tacks will be a concern for future web applications. Ad- 
dressing the problem of permission re-delegation prior 
to the full adoption of device APIs will be beneficial to 
future browser security. 
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We consider possible defenses against permission re- 
delegation attacks and propose IPC Inspection, a new 
OS mechanism that reduces a deputy’s privileges after 
receiving communication from a less privileged applica- 
tion. Privilege reduction reflects that a deputy is under 
the influence of another application. Consequently, the 
permission system can deny a privileged API call from 
the deputy if any application in the chain of influence 
lacks the appropriate permission(s). We implement IPC 
Inspection for two different platforms: the Android oper- 
ating system and ServiceOS’s browser runtime [38]. Our 
Android implementation prevents all of the attacks we 
discovered in our case study. We evaluate the impact of 
IPC Inspection on applications and anticipate that most 
applications would require few changes to work with IPC 
Inspection, but some might require more permissions. 


Contributions. We demonstrate that permission re- 
delegation is a widespread threat in modern platforms 
with per-application permissions and IPC. We also pro- 
pose IPC Inspection, a new OS mechanism to defend 
against permission re-delegation. 


Outline. We define the problem of permission re- 
delegation in Section 2. We then cast the problem in the 
context of today’s web and smartphone platforms in Sec- 
tion 3. We describe our experience of discovering per- 
mission re-delegation vulnerabilities in Android in Sec- 
tion 4. We discuss possible defenses utilizing known 
techniques in Section 5 and then propose a detailed de- 
sign for IPC Inspection in Section 6. We describe our 
implementation experience on Android and ServiceOS 
in Section 7. In Section 8, we evaluate the effectiveness 
of IPC Inspection. We discuss related work in Section 9. 


2 Permission Re-Delegation 


Permission systems prevent applications from perform- 
ing actions that are not desired by the user. We are con- 
cerned with attacks on user-controlled resources, which 
are the resources guarded by permissions that are granted 
by the user. Devices like the camera and GPS are user- 
controlled resources, as are private data stores like lists 
of calendars and contacts. We do not consider attacks 
on resources not controlled by user-granted permissions, 
like memory or application-specific databases. 
Permission re-delegation occurs when an application 
with a permission performs a privileged task on behalf 
of an application without that permission. This is a con- 
fused deputy attack [23] or privilege escalation attack. In 
this scenario, the user delegates authority to the deputy 
by granting it a permission. The deputy defines a pub- 
lic interface that exposes some of its functionality. A 
malicious requester application lacks the permission that 
the deputy has. The requester invokes the deputy’s inter- 
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Notify() 


Requester 


Send Text 


System API 





Figure 1: Permission re-delegation attack. The requester does 
not have the text message permission, but the deputy does. The 
deputy also defines a public interface with a Not ify method, 
which makes an API call requesting to send a text message. 
When the requester calls the deputy’s Notify method, the 
system API will send the text message because the deputy has 
the necessary permissions. Consequently, the attack succeeds. 


face, causing the deputy to issue a system API call. The 
system will approve and execute the deputy’s API call 
because the deputy has the required permission. The re- 
quester has succeeded in causing the execution of an API 
call that it could not have directly invoked (Figure 1). 

Permission re-delegation can occur in three ways. 
First, an application may accidentally expose internal 
functionality to other applications. Second, a “confused” 
deputy might intentionally expose functionality, but an 
attacker might invoke it in a surprising context [23]. 
Third, the developer might expose functionality with the 
goal of attenuating authority but implement the attenu- 
ation policy incorrectly or in a way that is inconsistent 
with the system policy. 

We cannot expect the deputy’s developer to “opt in” 
to extra security measures or implement system policies. 
The deputy is neither helpful nor harmful to system secu- 
rity. Most developers are not security experts, and they 
are not independently motivated to prevent permission 
re-delegation because the consequences of permission 
re-delegation do not affect the deputy itself. 

Although the deputy is trusted with some permissions, 
it is not trusted with all permissions. An application is 
trusted with precisely the set of permissions approved by 
the user, and the deputy and requester may have disjoint 
sets of dangerous permissions. A prevention mechanism 
cannot grant a deputy access to its requester’s permis- 
sions unless the deputy already has the permissions. 

We aim to equip the permission system with the abil- 
ity to deny API requests made by a deputy on behalf 
of an unprivileged requester. Such an access control 
mechanism prevents the requester from executing priv- 
ileged actions with side-effects or requesting sensitive 
data through another application. However, we do not 
address the problem of preventing a privileged applica- 
tion from sharing sensitive data that it has legitimately 
and independently obtained. We focus on protecting ac- 
cess integrity, whereas improper data sharing is a confi- 
dentiality problem. Preventing data leakage is a comple- 
mentary problem beyond the scope of our work. 
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3 Scenarios 


We discuss permission re-delegation as it applies to web 
browsers and smartphone operating systems. 


3.1 Web Applications 


Websites are mutually distrusting principals [37, 5, 3]. 
Today’s browsers isolate websites from one another un- 
der the Same Origin Policy, which states that code from 
one site cannot access another site’s content [33]. Tradi- 
tionally, websites have also been prevented from access- 
ing the user’s local resources. However, this is changing 
as web applications begin to exhibit rich functionality. 

Browsers are starting to offer APIs for accessing users’ 
local resources. For example, the HTML5 device el- 
ement [24] provides web applications with access to 
streaming audio and video data. The W3C Device APIs 
and Policy Working Group [35] is designing interfaces 
for contacts, calendars, messaging, cameras, and more. 
New versions of Firefox, Google Chrome, and Safari 
support preliminary versions of the HTML5 geolocation 
API [32]. Access to these APIs will be controlled by 
permissions that users grant to website origins. 

Web permission re-delegation can occur in two ways: 


1. New windows. In this scenario, a user unknow- 
ingly navigates to a malicious website. The ma- 
licious website opens a website with a permission 
in a new window. If the deputy website makes a 
privileged API call upon loading, then the malicious 
website has successfully mounted a permission re- 
delegation attack. This attack can be completely in- 
visible to the user if the malicious requester loads 
the deputy in an invisible child frame; alternately, it 
can be hidden by opening the deputy as a pop-under 
window beneath the active browser window. 


2. Messages. As in the previous scenario, a user un- 
knowingly navigates to a malicious website. The 
malicious website sends a message to a deputy 
website that offers services to other websites via 
postMessage, an asynchronous client-side mes- 
sage passing channel. Requesters can send mes- 
sages to new child frames or existing windows that 
are navigationally connected to the requester [25]. 


The website “http:/ /merged.ca” would like 
to use your current location. 


|_| Request permission only once every 24 hours 


© Don't Allow ) 


Figure 2: Safari 5 requests the user’s permission before grant- 
ing a website access to geolocation. 
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For example, consider a user who permanently grants a 
website the ability to record live video of the user for 
the purpose of streaming the video to a known web ad- 
dress (like Qik). The video website automatically begins 
simultaneously recording and streaming as soon as it is 
loaded. Later, the user visits a malicious website that 
loads the video website in an invisible child frame; as a 
result, the browser begins recording streaming live video 
without the user’s knowledge. 

Major browser vendors have recognized the problem 
of permission re-delegation. Mozilla Firefox, Safari, and 
Google Chrome implemented the geolocation permis- 
sion with restrictions on child iframes. When a web- 
site is opened as a child iframe, it has no geolocation 
permission; the user is prompted for approval, even if 
the user has previously granted the website the geoloca- 
tion permission. This prevents permission re-delegation 
attacks using iframes, but does not prevent permission 
re-delegation attacks on top-level windows or attacks be- 
tween two iframes embedded in the same page. We also 
want to extend the defense mechanism beyond geoloca- 
tion to all future permission-controlled browser APIs. 


3.2 Smartphone Applications 


Smartphone platforms like iOS, Android, and Windows 
Phone 7 support third-party application markets. The 
markets have a low cost of entry, and not all of the devel- 
opers are equally trustworthy. Consequently, smartphone 
operating systems treat applications as potentially mali- 
cious. Smartphone operating systems also provide APIs 
to phone devices (Bluetooth, camera, GPS, etc.) and the 
network. Upon user approval, smartphone operating sys- 
tems grant per-application access to these resources. 

This paper focuses on Android, although our work ap- 
plies to other smartphone operating systems with user- 
controlled application permissions and IPC. Android 
permissions are categorized into 3 security levels: Nor- 
mal, Dangerous, and Signature.! Normal permissions 
protect API calls that could annoy but not harm the user 
(e.g., SET.WALLPAPER); these do not require user ap- 
proval. Dangerous permissions let an application per- 
form harmful actions (e.g., RECORD_AUDIO). Signature 
permissions regulate access to extremely dangerous priv- 
ileges, e.g., CLEAR-APP_USER_DATA. Malware with 
Dangerous or Signature permissions can spy on users, 
delete data, and abuse their billing accounts [7, 27, 34]. 

Android applications may communicate with each 
other, which introduces opportunities for permission re- 
delegation. Inter-process messages known as Jntents are 
used for communication. Applications can make four 
types of components publicly available: 























'We group SignatureOrSystem and Signature permissions together 
since there is no significant distinction between the two categories. 
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This application has access to the 
following: 


A Your location 
coarse (network-based) location, fine 
(GPS) location 


& Network communication 
full Internet access 


A Your personal information 
read contact data 


& Storage 


modify/delete SD card contents 


& Hardware controls 
change your audio settings, take pictures 


A system tools 
modify global system settings, prevent 
phone from sleeping, read system log 


OK Cancel 


Figure 3: Android displays permissions for user approval dur- 
ing the installation of an application from the Android Market. 


e Services run in the background. They can be started 
with Intents or “bound” for synchronous calls. 

e Activities provide applications with user interfaces. 
They can be started with Intents. An Activity can 
optionally provide a return value that is delivered 
when the user finishes interacting with the Activity. 

e BroadcastReceivers handle broadcast Intents. They 
run in the background. Sometimes they invoke a 
Service or an Activity to handle the task. 

e ContentProviders are local databases. 


Services and BroadcastReceivers are inviting targets for 
stealthy permission re-delegation attacks because they 
may not have a visible user interface. A developer can 
restrict access to a component by specifying in the man- 
ifest that requesters must have a given permission or dy- 
namically checking the caller’s identity at runtime. 


3.3. Granting Permissions 


Current permission systems ask users to grant permis- 
sions in one of two ways: 


Time-of-Use. In a time-of-use permission system, users 
are prompted to approve or deny permission for a re- 
source when a privileged API call is made. Web browsers 
and Apple iOS use this type of permission for third-party 
applications. The permission may be granted perma- 
nently, for a period of time, or for a single use. Figure 2 
provides an illustration of Safari 5 asking a user to grant 
a time-of-use permission. 


Install-Time. In a system with install-time permissions, 
an application declares its permission requirements in a 
manifest file. The user is prompted to approve the per- 
missions during the installation process, and the appli- 
cation is only installed if the user grants permanent per- 
missions to the requested resources. In a strict install- 
time permission system, new permissions cannot be re- 
quested during runtime. Android and Windows Phone 7 
use install-time permissions. Figure 3 shows an example 
of an Android installation permission prompt. 
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Public Service Public Receiver 
Dangerous permissions 84 269 
Signature permissions 13 34 





Figure 4: We surveyed 872 Android applications and identified 
ones with both public components and notable permissions. 


4 Case Study: Attacks on Android 


We perform a case study on Android applications to 
demonstrate that permission re-delegation is a real-world 
concern. We find that many applications are at risk of 
containing a vulnerability, and we build example attacks 
using core system applications. 


4.1 At-Risk Applications 


An application is at risk of containing a permission re- 
delegation vulnerability if it both requests permissions 
and exposes a public interface. In particular, we are 
interested in stealthy attacks that can be conducted in 
the background. We examine a set of 872 applications, 
which is composed of the 16 core system applications 
that come pre-installed with Android 2.2, the 756 most 
popular free applications from the Android Market, and 
the 100 most popular paid Market applications. 

For each application, we parse its manifest to find 
its list of permissions, public Services, and public Re- 
ceivers. Services always run in the background, and 
Receivers might run in the background. By evaluating 
whether an application has both permissions and a Ser- 
vice/Receiver, we can identify at-risk applications. We 
discard public components that are protected by permis- 
sions in the manifest, as they may not be at risk. 

It is also possible for an application to perform a dy- 
namic check on its caller’s permissions, which would not 
be reflected in our manifest analysis. 9% of all appli- 
cations in our set perform dynamic permission checks; 
however, the permission checks are not often used to 
prevent external use of Services or Receivers. (They are 
typically used to protect Providers, which we do not con- 
sider, or by embedded advertising libraries to determine 
what operations they can perform.) We examined a set 
of 50 randomly selected applications with public compo- 
nents and found that only 1 application does so to protect 
a Receiver or Service. 

Figure 4 shows the highest-level permission of an ap- 
plication, and whether it also has an unprotected public 
component. Overall, 320 of the 872 applications (37%) 
have permissions and at least one type of public compo- 
nent. 11 of the 16 system applications are at risk. 
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4.2 Vulnerabilities 


We examine the at-risk system applications to identify 
stealthy permission re-delegation vulnerabilities. An ap- 
plication contains a vulnerability if there exists an ex- 
ploitable path in the application between a public entry 
point and a restricted system API call. We construct at- 
tacks using 15 vulnerabilities in 5 applications. These 
vulnerabilities are present on every Android 2.2 phone. 


Finding Attacks. To find vulnerabilities, we created a 
path-finding tool. We first disassemble the at-risk system 
applications using Dedexer [31]. Our tool then parses 
the disassembled applications to find method declara- 
tions and invocations; from the results, we build the call 
graphs of the applications. Searching the call graphs re- 
veals paths from public entry points to protected system 
API calls. This tool likely misses viable attack paths; our 
call graph analysis does not handle inheritance relation- 
ships or detect flow through structures like callbacks. We 
also were only able to look for attacks on a subset of the 
API due to incomplete documentation. 

We identify valid attacks by building test cases for the 
paths produced by our path-finding tool. We only con- 
sider attacks on API calls with verifiable side effects, so 
that we can tell if an attack succeeds. We are also only 
able to look for attacks on a subset of the API because 
Android permission documentation is incomplete. 


Results. We built attacks using 5 of the 16 system ap- 
plications. All of the attacks succeed in the background 
with no visible indication to the user. These 5 applica- 
tions provide 15 paths to 13 interfaces with permissions, 
one of which is SignatureOrSystem and nine of which 
are Dangerous. We present two example permission re- 
delegation attacks: 


Settings serves as the phone’s primary control panel. 
When a user presses certain buttons, Settings’ user in- 
terface sends a message to a Settings BroadcastReceiver 
that turns WiFi, Bluetooth, and GPS location track- 
ing on or off. However, Settings’ BroadcastReceiver 
accepts Intents from any application. An unprivi- 
leged application can therefore ask the Settings ap- 
plication to toggle the state of devices by sending 
its BroadcastReceiver the same Intents that it ex- 
pects from its user interface. Turning these devices 
on or off is supposed to require Dangerous permis- 
sions (CHANGE_WIFI_STATE, BLUETOOTH_ADMIN, 
and ACCESS_FINE_LOCATION, respectively). 

















Desk Clock provides time and alarm functionality. One 
of its public Services accepts directions for playing 
alarms. If an unprivileged application sends an Intent re- 
questing an alarm with no end time, then Desk Clock will 
indefinitely vibrate the phone, play an alarm, and prevent 
the phone from sleeping. The alarm will continue un- 
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til the user kills the Desk Clock process. Playing sound 
with a wake lock requires the Dangerous WAKE_LOCK 
permission, and vibrating the phone requires the Normal 
VIBRATE permission. 








We found concrete vulnerabilities in 5 of the 16 appli- 
cations, which amounts to half of the 11 system appli- 
cations that we identified as at-risk. It is likely that the 
other at-risk system applications also contain vulnerabil- 
ities that we did not uncover. (Our call graph tool is lim- 
ited, as discussed above.) We have notified the Android 
team; several of the vulnerabilities have been confirmed 
and fixed [16, 15, 17, 18]. 


5 Defense Discussion 


We present our requirements for a permission re- 
delegation defense mechanism and consider whether ex- 
isting techniques can satisfy these goals. 


5.1 Goals 


Our goals for a successful permission re-delegation de- 
fense mechanism are: 


1. Preventing permission re-delegation: Applications 
should not be able to re-delegate permissions for 
user-controlled resources. 


2. Runtime independence: We want the solution to be 
language- and runtime-independent. This is advan- 
tageous because many platforms support applica- 
tions that are written in different programming lan- 
guages and run on different runtimes. 


3. Developer independence: Given evidence that de- 
velopers are unmotivated to proactively prevent per- 
mission re-delegation (Section 4.2), a solution can- 
not rely on developer diligence for security. 


4. Ease of development: The mechanism should not 
impose an excessive burden on developers; applica- 
tions should retain their functionality. 


5. Dynamic: The defense mechanism must work at 
runtime and not depend on application analysis. 
Client-side application analysis is not feasible for 
web applications because website code can change 
and encompass arbitrarily many documents. 


Our focus is on controlling access to resources; it is 
not our goal to protect data privacy. We wish to prevent a 
privileged piece of data from being accessed, but we do 
not aim to prevent it from being shared once it has been 
accessed. Protecting data privacy can be achieved with 
complementary solutions like TaintDroid [12]. 
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5.2 Potential Defenses 


Capabilities. A capability is an unforgeable, shareable 
token that, when used, grants access to a privilege [23]. 
To prevent permission re-delegation, a deputy could ask 
its requester to provide a capability and use it to make 
system API calls. However, this approach does not meet 
our goal of developer independence; a poorly written 
deputy could use its own capabilities rather than its re- 
quester’s when making system API calls. 

Alternately, a system could control access to privi- 
leges by requiring approval before granting an applica- 
tion the ability to communicate with a deputy. This 
would require static analysis of the deputies installed on 
the client to understand what user-relevant privileges the 
communication would involve. Following the example 
in Figure 1, an application that wants to use the “Notify” 
deputy would need to be authorized to use the underlying 
“Send Text” authority. Static capability provisioning is a 
topic for further exploration, but it can only be applied to 
platforms where static analysis of deputies is possible. 


Taint Tracking. In a taint tracking-based solution, the 
requester’s data could be the source of taint, and the taint 
could propagate as the requester’s data interacts with the 
deputy. If a deputy makes a privileged API call and the 
call is tainted, then the system could become aware that 
permission re-delegation has occurred. Unfortunately, 
tracking both data flow and control flow in a runtime- 
independent manner incurs more than an order of mag- 
nitude of performance overhead [14]. Furthermore, taint 
tracking both data and control flow would likely lead to 
taint explosion. Taint explosion would make the system 
unnecessarily restrictive. A taint tracking-based solution 
would need to track both direct and indirect taint flows 
because not all permission re-delegation attacks are data- 
dependent. For example, not all API calls require param- 
eters, and some applications process IPC calls without 
reading the actual message. 


MAC. Mandatory access control (MAC) systems (e.g., 
[9, 20]) are centralized information flow control systems 
where the operating system enforces a fixed information 
flow control policy across integrity or confidentiality lev- 
els. Such systems mandate that no information can flow 
from low-integrity principals to high-integrity principals 
or from high-confidentiality to low-confidentiality prin- 
cipals. Our goal of preventing permission re-delegation 
can be cast as MAC to some extent: a permission set 
could be treated as an integrity label, and A would have a 
lower integrity level than B if A has a permission that B 
does not. We are then concerned about safe information 
flow with respect to access to user-controlled resources. 
However, Android applications cannot be strictly or- 
dered because applications often do not fit the subset re- 
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lationship. When this happens, neither the deputy nor 
the requester has a strictly higher integrity level. This 
presents an application functionality problem: when a 
requester initiates communication with a deputy, the re- 
quester cannot receive the response if it has a permission 
that the deputy lacks. Since Android applications com- 
monly have intersecting but non-subset permission sets, 
this represents a significant functionality problem. Sec- 
tion 8.2 describes examples of applications for which this 
would be prohibitively restrictive. 


Stack Inspection. Stack inspection [19, 36] is used in 
Java Virtual Machines and the Common Language Run- 
time to prevent confused deputy attacks within a runtime. 
When a deputy makes a privileged API call, the system 
checks whether the call stack includes any unprivileged 
applications. Principals in the permission re-delegation 
threat model operate in separate runtimes; to adapt to 
this scenario, the runtime could annotate the bottom of 
the stack with the requester’s identity when delivering a 
message event or starting the application. 

Standard stack inspection has several shortcomings. 
First, the approach is dependent on the runtime for 
correctness and would need to be re-implemented re- 
peatedly for a system with multiple types of runtimes. 
Second, stack inspection cannot prevent permission re- 
delegation when the deputy’s API call is de-synchronized 
from the request because the requester does not appear 
on the call stack. For example, JavaScript is event-driven 
after the initial document loads, and each event has a dif- 
ferent stack; and Android applications often make inter- 
nal IPC calls (each of which resets the stack) in order to 
complete a single operation. 


HBAC. History-based access control (HBAC) reduces 
the permissions of trusted code after any interaction with 
untrusted code [1]. Like stack inspection, HBAC relies 
on runtime mechanisms and does not achieve our goal of 
runtime independence. Like MAC, HBAC performs per- 
mission reduction upon receipt of return values, which 
places constraints on application functionality. 


6 IPC Inspection 


We build upon existing techniques to propose a new de- 
fense for permission re-delegation. We track information 
flow through inter-application messages, but not within 
an application. Our solution is similar to stack inspec- 
tion or HBAC, but modified to address their limitations. 
Like HBAC, we perform privilege reduction following 
communication, but we apply our mechanism at the OS 
level rather than as part of a runtime. 
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6.1 Our Design 


We propose IPC Inspection. When an application re- 
ceives a message from another application, we reduce 
the privileges of the recipient to the intersection of the 
recipient’s and requester’s permissions. We consider an 
application to be acting as a deputy on behalf of a re- 
quester once it has received communication from the re- 
quester. The deputy’s current set of permissions captures 
the communication history of the deputy and other appli- 
cations. Privilege reduction does not remove privileges 
that are not controlled by the user. 

IPC Inspection carries the same semantics as stack in- 
spection, but we generalize method invocations to IPC 
calls and externalize intra-application asynchronies like 
message queues. IPC Inspection is also runtime- and 
language-independent. 


Basic Rules. IPC Inspection is comprised of three pri- 
mary mechanisms. First, we maintain a list of current 
permissions for each application. Second, we build priv- 
ilege reduction into the system’s inter-application com- 
munication mechanisms. Starting an application and 
sending an explicit message both count as IPC. Third, 
we allow a receiving application to accept or reject mes- 
sages. Applications can limit who they receive messages 
from by registering a list of acceptable requesters. De- 
pending on the platform, acceptable requesters can be 
identified individually (e.g., by domain) or based on their 
set of permissions (i.e., any application with permission 
Y). This prevents privilege reduction from being abused 
as a denial of service mechanism. 

More precisely, we define four basic rules to govern 
access rights and privilege reduction. We write A > B 
to indicate that application A sends a message to applica- 
tion B, and let P‘(A) denote the set of permissions held 
by application A at time t. The rules follow: 


1. Initial state: P°(A) = P°"'9'"2'(_A), When an ap- 
plication starts running, it begins with the permis- 
sions that were granted by the user. 

2. Privilege reduction for recipient: If R + D at time 
t, then P*(D) = P*-1(D)M P*"1(R). When an 
application receives a message, its permissions are 
reduced to the intersection of its and the sender’s 
current permissions. 

3. Sender’s permissions remain unchanged: If R + D 
at time t, then P?(R) = P*—1(R). 


Several properties of privilege reduction follow from 
the access rights rules: 


e Transitivity. An application’s current permissions 
reflect the permissions of all of the applications in 
a chain of communication. If R; — Rp at time 
t, and Ro > D at time ¢ + 1, then P**!(D) = 
POR ie ene). 
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e Additivity. If an application receives messages 
from multiple applications, then its permissions 
will be repeatedly reduced. P*(D) = P°(D)N 
Miz} P’(Ri), where R; > D for each time i. 

e Bounds. An application’s current permissions can 
never exceed its original permissions (i.e., P*(A) C 
P°(A),Vi); there is no mechanism for increasing 
permissions. 


Privilege reduction requires the platform to compute 
the intersection of two permissions, and this intersection 
function will differ by permission scheme. For example, 
permissions can be hierarchical, temporal, or monetary 
(as in $5 for text messages). When calculating the inter- 
section, the lesser value needs to be taken: the permission 
lower in the hierarchy, the lower monetary limit, or the 
shorter temporal permission. 


Non-Simplex IPC. The basic rules apply to simplex (uni- 
directional) communication. Simplex communication 
implies a clear relationship: the requester sends a mes- 
sage or starts an application, and the deputy acts. How- 
ever, an operating system can offer other communication 
mechanisms. With request-reply IPC, the recipient of a 
message returns a value. For example, an Android Ac- 
tivity may return a result upon completion. The roles 
of deputy and requester remain clear with request-reply 
IPC, so IPC Inspection does not reduce the requester’s 
permissions when it delivers a reply. In contrast, duplex 
communication is a stream of data that flows between 
two applications (e.g., TCP sockets). Both applications 
can act as a deputy or requester during duplex IPC, so 
IPC Inspection reduces the privileges of both. Applica- 
tions can implement alternate protocols atop these three 
primitives, but the OS will not be aware of them. 

IPC Inspection is similar in spirit to Mandatory Access 
Control, but IPC Inspection is less strict because it does 
not enforce privilege reduction in both directions after 
request-reply communication. This decision is in the in- 
terest of application functionality: highly-privileged re- 
questers would be unable to accept responses from less- 
privileged deputies without risking loss of privilege. Al- 
though there is a chance that a permission re-delegation 
attack could stem from the receipt of a return value, it is 
not a common case. Section 8.2 describes examples of 
applications that would not be able to function if return 
values prompted privilege reduction. 

HBAC also reduces privileges based on return values 
but attempts to preserve application functionality by pro- 
viding authors with explicit rights restoration. In HBAC, 
an application can validate a return value and then re- 
quest to have its removed privileges reinstated. Although 
rights restoration solves the functionality problem, it re- 
lies upon developers correctly and non-spuriously restor- 
ing their permissions. Developers could abuse this mech- 
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anism by restoring permissions in every permission fail- 
ure exception handler. Therefore, we choose not to sup- 
port explicit rights restoration in IPC Inspection. 


Application Instances. Deputies may need to simulta- 
neously interact with the user and multiple requesters. 
For example, the user might be interacting with an appli- 
cation when it receives a message from a less-privileged 
requester; the ensuing privilege reduction caused by the 
requester would interfere with the user’s experience. To 
prevent privilege reduction from impeding application 
functionality, we create new application instances to han- 
dle messages. All applications have a primary instance, 
which is the instance of the application that the user in- 
teracts with. When a requester asks the system to send 
a message to the deputy, the system automatically starts 
a new instance of the application. Multiple instances of 
the same application will run concurrently, in their own 
isolation units with their own current permissions. 

As a performance optimization for install-time permis- 
sion systems, it is not always necessary to create a new 
instance. The primary instance can be used for requests 
that do not prompt privilege reduction. In an install-time 
permission system, we know that privilege reduction is 
not necessary if the deputy already lacks permissions or 
the requester has a superset of the deputy’s permissions. 
Instance reuse is not possible with time-of-use permis- 
sion systems because the deputy could dynamically re- 
quest more permissions; it would not be clear which re- 
quester is responsible for the permission prompt. 

Some applications cannot exist in duplicate. Long- 
running background processes may have state that cannot 
be multiply instantiated. For this purpose, the system can 
let applications request to be singletons. All communi- 
cation events will be dispatched to the same instance for 
a singleton application, and a singleton application’s per- 
missions will be repeatedly reduced upon the delivery of 
each communication event until the application exits. 

Our altered version of Android automatically creates 
a new instance for every communication event unless a 
deputy asks to be a singleton; our browser implemen- 
tation creates a new instance whenever a requester pro- 
grammatically opens a new window, but not when mes- 
sages are sent to an existing window. The singleton de- 
sign pattern does not make sense from a web application 
perspective because websites already expect to be simul- 
taneously open in multiple windows in the same browser. 


Circumvention. A malicious deputy could circumvent 
the IPC Inspection rules: a developer could place in- 
formation about a request in storage and then perform 
the request later when the deputy regains full privileges. 
This does not violate any of our goals. We are not con- 
cerned with deputies maliciously sharing permissions; 
after all, a malicious deputy could directly abuse its 
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permissions. Instead, we are concerned about benign 
deputies thoughtlessly or accidentally giving away privi- 
leges to other applications. We expect that the pattern of 
saving requests in storage would be rare in practice. 


Permission re-delegation attacks could be mounted us- 
ing return values from request-reply IPC. In this attack, 
a privileged application sends a message to an unprivi- 
leged application. The unprivileged application returns 
a malicious value, which causes the privileged applica- 
tion to perform a malicious action. We do not defend 
against this attack in the interest of application function- 
ality. Our policy of disregarding return values is similar 
to the policy enforced by stack inspection. 


6.2 Platform Proposals 


We discuss the impact of IPC Inspection rules on permis- 
sions and communication, and we present proposals for 
how systems could add extra support for IPC Inspection. 


Permission Requests. IPC Inspection changes how 
users and developers interact with permissions. The im- 
pact of IPC Inspection depends on whether the system 
uses time-of-use permissions or install-time permissions. 
Time-of-use permission systems are the simpler case: 
in a time-of-use permission system like the browser, 
the user could be prompted whenever permission re- 
delegation is detected. For example, “Allow R and D to 
send text messages?” If the user answers affirmatively, 
the API call completes. As such, time-of-use permis- 
sion systems can accommodate IPC Inspection without 
changes to the developer experience. 


The relationship between IPC Inspection and install- 
time permission systems (e.g., Android) is more com- 
plex. If IPC Inspection is used with a pure install-time 
permission system, then an application’s developer must 
request all of the permissions used by the deputies that 
the application interacts with. First, this might be hard to 
determine. Second, this could lead to permission bloat: a 
requester must have all of the permissions needed by its 
deputy or deputies, even if the requester never individu- 
ally uses the permissions. To prevent this, we propose the 
relaxation of install-time permission requirements when 
IPC Inspection is applied to a platform. If permission 
re-delegation is detected, the platform could prompt the 
user to grant the requester temporary access to the priv- 
ilege via the deputy. If granted, the deputy’s permission 
would be restored. This would prevent requester applica- 
tions from needing to request permissions that they will 
not use independently from deputies. This change would 
require usability studies to determine whether it effects 
user understanding of permissions. 
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Request-Reply for the Web. Today’s websites ex- 
change messages using post Message, a simplex com- 
munication primitive. Websites that wish to use request- 
reply semantics must construct the necessary support on 
top of postMessage; e.g., such support is built into 
the popular jQuery library. Unfortunately, the browser 
is unaware of such application-level semantics and must 
apply IPC Inspection rules to all postMessage recip- 
ients, which may conflate the requester and the deputy. 
When a requester receives a postMessage corre- 
sponding to a return value, the browser will treat it as 
a deputy and unnecessarily reduce its privileges. 

Current web standards lack a request-reply IPC 
primitive that would let browsers avoid this problem. 
We propose adding such a primitive by extending 
postMessage to take an optional callback argument. 
Replies would not trigger privilege reduction. 


Device Policies. Instance reuse is not possible for sys- 
tems with time-of-use permissions, as discussed in Sec- 
tion 6.1. However, it would be possible if the browser 
could identify sites that do not ask for any permis- 
sions. We propose that browsers only grant device ac- 
cess to web applications if they statically declare that 
they will ask for permissions. Web sites without permis- 
sions would never need to be multiply instantiated. This 
could be implemented, for example, with Content Secu- 
rity Policies, which already support developer-authored 
restrictions on what scripts, images, etc. can be loaded 
into a page [28]; a new rule would enable device access. 


7 Implementation 


We implemented two IPC Inspection prototypes: one for 
Android and one for ServiceOS’s browser runtime. 


7.1 Android 


We implemented a prototype of IPC Inspection as part of 
Android 2.2. We added support for IPC Inspection to the 
PackageManager and ActivityManager. The PackageM- 
anager installs applications, stores their permissions, and 
enforces permission requirements. The ActivityManager 
handles communication between applications and starts 
applications as necessary. 

Five events trigger privilege reduction: starting a Ser- 
vice, binding a Service, starting an Activity, receiving 
a Broadcast Intent, and requesting a ContentProvider. 
Our altered ActivityManager notifies the PackageMan- 
ager whenever any of these five communication events 
occurs. One notable exemption is the Launcher system 
application, which we allow to communicate with any 
other application freely, to prevent privilege reduction 
from occurring whenever the user launches applications. 


USENIX Association 


When the PackageManager is notified of a pending 
communication event, it checks whether privilege reduc- 
tion needs to occur. The message is dispatched normally, 
without privilege reduction, if (1) the message is from 
the system process, or (2) the requester has all of the tar- 
get’s permissions. Otherwise, privilege reduction of the 
target occurs before the message is delivered. 


The mechanics of privilege reduction differ based on 
whether the target application is a singleton. The Pack- 
ageManager reduces privileges of singleton applications 
by removing the appropriate permission(s) from the data 
structure that assigns permissions to application UIDs. 
An application can request to be a singleton by setting 
a singleton value in its manifest. For non-singleton 
applications, the PackageManager instructs the Activity- 
Manager to create a new instance of the application to 
receive the message. The ActivityManager places each 
new instance in a new process, with the same UID as 
the application’s primary instance. When the instance’s 
process is created, the PackageManager records the re- 
moved permissions in a data structure associated with the 
instance’s PID. Instances of an application have access 
to the same files because they share a UID. Android also 
uses UIDs as a security boundary between applications, 
but we assume instances are not trying to attack each 
other. For both singletons and non-singletons, the Pack- 
ageManager records which requester is responsible for 
the removal of removed permissions in a blame map. 


Permission enforcement in our modified version of 
Android occurs in two steps. First, the standard permis- 
sion enforcement mechanism checks whether the given 
permission is assigned to the application’s UID. This 
check will return the same result for all instances of an 
application, since permissions are associated with UIDs. 
If the standard permission check succeeds, then our al- 
tered PackageManager additionally checks whether the 
permission is in the process’s list of removed permis- 
sions for that instance. If it is, then access is denied. The 
blame map allows the PackageManager to identify the 
requester that is responsible for a permission failure. Fol- 
lowing our proposal in Section 6.2, the operating system 
could then ask the user to temporarily grant the permis- 
sion, although we did not implement this user interface. 


We also extend the manifest file format so that 
deputies can limit incoming messages. The existing An- 
droid permission attribute lets an application specify 
a permission that a requester must have. We extend this 
to accept a set of permissions. If the developer limits 
the receipt of incoming messages to requesters with ad- 
equate permissions, then the application will never need 
to undergo privilege reduction. 
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7.2 ServiceOS 


To demonstrate IPC Inspection in the context of web 
applications, we implemented IPC Inspection as a per- 
mission manager for ServiceOS, a client platform that 
supports both web and desktop applications [38]. Our 
implementation enabled IPC Inspection for ServiceOS’s 
browser runtime. 

In ServiceOS, each web origin is a principal as defined 
by the Same Origin Policy [33]. Permissions for user- 
controlled resources are granted on a per-principal basis. 
When a user navigates to a website by following a link 
or entering a new URL into the location bar, the resulting 
window is associated with the appropriate site principal. 
The user is asked to grant or deny permissions for that 
origin. The permission manager keeps track of all current 
permissions and controls access to a mock device API 
that represents the new HTMLS APIs. 

The browser’s communication system informs the per- 
mission manager of communication events. Two events 
trigger privilege reduction: 


PostMessages. When a website R.com sends a 
postMessage to a window belonging to D. com, this 
communication passes through the IPC mechanism in 
the browser. Consequently, D.com’s permissions are 
subject to reduction. Permissions are restored when all 
windows belonging to a principal are closed. To pre- 
vent DOS, we provide websites with the ability to limit 
which origins they receive postMessages from; mes- 
sages from other origins will be dropped by ServiceOS. 


New Windows. When a website R.com creates a new 
window (e.g., a child frame) belonging to D. com, we 
treat this as a new service request; the parent window 
is the requester and the new window is the deputy. We 
consequently create a new principal for the new window, 
isolated from the rest of its origin. The new window’s 
privileges are immediately reduced. Unfortunately, we 
cannot provide a mechanism for limiting who a web site 
can be opened by; websites are typically opened by so 
many others that a whitelist is not realistic. 


As with the Android implementation, we record priv- 
ilege reduction so that a correct prompt could be dis- 
played to the user to explain the permission failure. Ad- 
ditionally, this implementation could be re-implemented 
in any major browser. 








Action Data 
Normal 15 26 
Dangerous 59 31 
Signature/System 10 1 
Total 84 58 


Figure 5: 142 Android API calls, classified. 
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8 Evaluation 


We evaluate IPC Inspection with respect to security and 
ease of application development. 


8.1 Effectiveness 


Our primary goal is to prevent permission re-delegation. 
We evaluate IPC Inspection for Android security. 


Scope. IPC Inspection strengthens access control for 
user-controlled resources and prevents applications from 
making unauthorized API calls. Now, we evaluate the 
scope of protection on Android system APIs. 

We consider 142 methods from the Android API that 
are protected with permissions and classify each method 
as action or data calls. Action calls have side effects, and 
data calls return values without side effects. The set of 
142 methods includes all of the protected methods in the 
SDK documentation, plus additional protected methods 
we identified using randomized testing. There are likely 
more protected interfaces to be identified, but we believe 
that the set of 142 methods is a representative sample 
of the full set of protected interfaces. We classify each 
method according to its description in the documenta- 
tion. Accessors are typically data calls and mutators are 
typically action calls. 

IPC Inspection can prevent both action and data calls 
from being invoked when an application is acting un- 
der the influence of another, less-privileged application. 
Nevertheless, IPC Inspection does not provide privacy 
for the return results of data calls, if the API call was 
not made on behalf of the requester. For example, an 
application with privileged geolocation data may pass a 
cached location value to a less privileged application. We 
emphasize, however, that IPC Inspection does prevent an 
application from obtaining the data while under the in- 
fluence of another application. 

We conducted a measurement on the makeup of action 
and data on Android. Figure 5 shows the results. Nearly 
70% of the interfaces protected by Dangerous and Sig- 
nature/System permissions are action calls. 


Attack Prevention. Our Android implementation pre- 
vents all of the permission re-delegation attacks de- 
scribed in Section 4. 

We suspect that many Android applications do not 
truly intend for their publicly invokable interfaces to be 
public; instead, the interfaces are intended for internal 
communication or messages from the operating system. 
Some messages that are typically sent by the operating 
system can also be sent by non-system applications; if 
the application does not additionally check the identity 
of its caller, it can be confused into performing an action. 
For example, we found in Section 4 that the Phone appli- 
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Figure 6: In this ServiceOS test attack, a “Game” application 
with no permissions opens a “Dialer” application from a differ- 
ent domain as a child frame. The user previously granted the 
Dialer the ability to send text messages, but the permission is 
removed upon loading because the Game lacks it. The Game 
sends the Dialer a postMessage asking the Dialer to send a 
text message. The Dialer’s API call is denied. 


cation will mute indications of an incoming phone call 
based on a message it expects to receive from the sys- 
tem. Tests on 5 applications that appear to fall in this cat- 
egory indicate that IPC Inspection prevents unintention- 
ally public interfaces from being surprisingly invoked. 
We built attack test suites for both Android and Ser- 
viceOS, and our implementation prevents all of the test 
attacks from succeeding. Each test suite is comprised of 
a set of communications from an unprivileged applica- 
tion to a privileged application. The privileged applica- 
tion is set up to make an API call following the receipt of 
any type of message. In a successful test, IPC Inspection 
prevents the API call from completing. We exercise all 
of the communication events available in each platform. 
Figure 6 is a screenshot of a test attack in ServiceOS. 


8.2 Ease of Development on Android 


Although we aim to prevent permission re-delegation, 
we do not want to prevent legitimate application interac- 
tions. We discuss four categories of applications: non- 
deputies, intentional deputies, unintentional deputies, 
and requesters. An application is an intentional deputy 
if it is built to expose functionality to other applications, 
whereas an application is an unintentional deputy if it 
exposes internal functionality accidentally. The Android 
communication system makes it easy for applications to 
accidentally expose internal functionality [6]. We esti- 
mate the prevalence of these four types of applications. 


Non-Deputies. An application that does not offer ser- 
vices to other applications will not be greatly impacted 
by IPC Inspection. A non-deputy application will not 
need to be multiply instantiated, nor will it experience 
privilege reduction. 
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Unintentional Deputies. IPC Inspection will prevent 
malicious applications from using accidentally public in- 
terfaces of unintentional deputies to launch permission 
re-delegation attacks. The application will not be af- 
fected during normal operation. 


Intentional Deputies. Applications that do provide pub- 
lic services can take one of two approaches, depending 
on their needs. The first option is that a deputy can accept 
calls from arbitrary requesters. A developer that chooses 
this option should place security exception handling code 
around API calls that require permissions, in case an un- 
privileged requester causes privilege reduction of an in- 
stance. The second option is for an application to re- 
quire that potential requesters have all of the permissions 
necessary to make the relevant API calls. A developer 
that chooses this option therefore needs to specify a list 
of required permissions. This choice obviates the need 
for multiple instances, which is beneficial from a perfor- 
mance perspective. However, it may reduce the num- 
ber of eligible requesters. A singleton application should 
choose this option to prevent its primary (and only) in- 
stance from experiencing privilege reduction. 


Here, we discuss the impact of IPC Inspection on three 
popular, real-world intentional deputies: 


Barcode Scanner. The “ZXing” barcode scanner is 
among the 50 most popular free applications in the An- 
droid Market. It provides public interfaces for scanning, 
creating, and displaying barcodes. ZXing uses several 
permissions to complete these tasks (e.g., CAMERA). ZX- 
ing correctly attenuates authority by asking for user per- 
mission before performing privileged tasks, so IPC In- 
spection does not add security in this case. ZXing can be 
repeatedly instantiated without any apparent issues, so 
ZXing does not necessarily need to limit its requesters to 
applications with certain permissions. 





E-Mail. “GMail” is the official Google e-mail client. 
It provides several public interfaces. The primary pub- 
lic interface is an e-mail composition Activity that 
can be pre-seeded by the requester. GMail uses sev- 
eral permissions to send a pre-composed e-mail, e.g., 
WRITE_EXTERNAL_STORAGE (for uploading file at- 
tachments) and INTERNET. It is not clear whether all of 
GMail’s other public interfaces are truly intended to be 
public: for example, one BroadcastReceiver listens for 
a message that indicates login accounts have changed. 
Like ZXing, GMail can be repeatedly instantiated with- 
out exhibiting obvious flaws. 


























Music Player. “Music” is one of the pre-installed system 
applications. It provides many public interfaces, includ- 
ing a MediaPlaybackService. The MediaPlaybackSer- 
vice opens music files, starts and stops music playback, 
and manages the current playlist. It uses permissions 
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such as WRITE_-EXTERNAL_STORAGE (to open files) 
and WAKE_LOCK (to keep the phone on while playing 
music). We discovered in Section 4 that permission re- 
delegation attacks can be mounted using the MediaPlay- 
BackService, but they are prevented with our Android 
IPC Inspection implementation. MediaPlayBackService 
needs to run as a singleton because it is a long-running 
background service that maintains state. 








In summary, ZXing and GMail developers can choose 
whether to write exception handling code or lists of per- 
mission requirements for their requesters. The Music ap- 
plication is a singleton, so its developer should accept 
requests only from applications with all of its required 
permissions. The developer must specify that Music is a 
singleton and list the desired requester permissions. 


Requesters. Under IPC Inspection, deputies may require 
their requesters to have more permissions. We present 
three example requesters that make use of the deputies 
presented above and consider whether they already have 
the necessary permissions. Additional install-time per- 
missions would not be necessary if Android were to al- 
low time-of-use permissions for the specific case of in- 
teracting with deputies. 

We also consider the effects of a hypothetical rule that 
reduces privileges for requesters upon receipt of a reply 
value. Stricter policies (MAC and HBAC) include such 
a rule, as discussed in Section 6. 


Barcode Scanner. Many applications rely on the ZXing 
barcode scanner [41]. One example is “Beer Cloud,” 
which lets users find nearby bars that serve particular 
beers. Beer Cloud invokes the ZXing barcode scanner to 
identify beers. Under IPC Inspection, Beer Cloud would 
require the CAMERA permission to interact with ZXing. 
Currently, Beer Cloud does not have the CAMERA per- 
mission; IPC Inspection would require it to add it, which 
might overprivilege the Beer Cloud application. 

Once Beer Cloud has received the barcode data from 
ZXing, it passes the beer information and user location 
to a backend server. The server returns nearby bar ad- 
dresses. Beer Cloud uses Internet and location permis- 
sions to accomplish this. If we were to implement priv- 
ilege reduction following return values, then Beer Cloud 
would not be able to pass the beer and location data to its 
backend server because ZXing does not have the neces- 
sary location permission. 








E-Mail. “Blackmoon File Browser” relies on GMail for 
file sharing. Blackmoon File Browser only has one per- 
mission, WRITE_EXTERNAL_STORAGE. Under IPC In- 
spection, it would require the INTERNET permission as 
well. None of GMail’s public interfaces return values, so 
a hypothetical return value rule would not impact Black- 
moon File Browser or any other requesters of GMail. 
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Intentional Deputy 5 applications 
Unintentional Deputy 4 applications 
Requester 6 applications 


Figure 7: We classify 20 Android applications. 13 applica- 
tions are deputies or requesters. One is both an intentional and 
unintentional deputy, and another acts as both a deputy and a 
requester. All have Dangerous permissions. 


Music Player. “ScrobbleDroid” uses the Music applica- 
tion’s MediaPlayBackService to track the user’s recently 
played songs. The recently played songs are then posted 
on the website last.fm. Binding to the singleton Me- 
diaPlayBackService would require ScrobbleDroid to add 
four permissions under IPC Inspection. MediaPlayBack- 
Service does not actually use all four of the extra permis- 
sions (they are used elsewhere in Music), so this would 
slightly over-privilege ScrobbleDroid. In the reverse di- 
rection, ScrobbleDroid uses return values provided by 
the MediaPlayBackService. Since ScrobbleDroid has 
a permission that Music does not have, ScrobbleDroid 
would be impacted by the hypothetical return result rule 
that we rejected in Section 6. 


In summary, Beer Cloud and Blackmoon File Browser 
would need to gain user approval for additional per- 
missions that make sense considering their functionality. 
However, they wouldn’t use the permissions for anything 
but communication with deputies. ScrobbleDroid would 
need otherwise unnecessary permissions because Music 
is a singleton. Beer Cloud and ScrobbleDroid also illus- 
trate why we do not reduce privileges for request-reply 
message exchanges. 


Prevalence. We consider 20 randomly selected Android 
applications (from our set of 872) and evaluate whether 
they act as deputies, requesters, or both. We manually 
interact with the applications’ user interfaces, log com- 
munication events, and examine their manifests. 

Figure 7 shows the results of the survey. Under IPC 
Inspection, developers of intentional deputies and re- 
questers may need to make minor changes to their ap- 
plications: intentional deputies might need to specify 
permission requirements for requesters, and requesters 
might need to add extra permissions. 11 of the 20 appli- 
cations are intentional deputies, requesters, or both. 

We also classify 4 of the 20 applications as uninten- 
tional deputies. IPC Inspection would prevent these acci- 
dentally public interfaces from being used for permission 
re-delegation attacks. The developer of the first uninten- 
tional deputy obviously copied part of the manifest from 
another application with public interfaces. The second 
unintentional deputy’s public interface accepts paths to 
local and remote files, which it then loads as an update to 
the application. It appears that the developer expects the 
files to be provided by the browser as part of an update 
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mechanism from their website; in reality, any applica- 
tion can supply the path to the file. The third crashes 
when any of its public Activities are loaded by other ap- 
plications. The fourth has a public Activity that does not 
appear to be a useful addition to any other application. 


8.3 Ease of Web Development 


We discuss IPC Inspection from the perspective of a web 
developer and give examples of how it would be applied. 


Deputies. | Web applications that want to accept 
postMessages without risking privilege reduction 
should register a list of trusted requesters. It is already 
best practice to check the origin of message senders [29]; 
we make this logic explicit by providing a mechanism to 
register a list of acceptable requesters with the browser. 
Even if a message does cause a permission failure, web 
applications should already be built with the expectation 
that access to a device API might fail because users al- 
ready expect to continue interacting with websites after 
denying permissions. 

Any web application, regardless of whether it intends 
to act as a deputy, may be multiply instantiated because 
any website can be opened by another web site. Web 
applications already expect to be simultaneously open in 
multiple tabs in the same browser. 

IPC Inspection does impose one restriction on web ap- 
plications. If a. com opens the child frame b. com, and 
b.com in turn opens a child frame a.com, we place 
the two versions of a. com in separate instances because 
they have different requesters.2_ The two instances of 
a.com can obtain references to each other’s window 
objects [25], which we support. However, the Same Ori- 
gin Policy implies that they should have full access to 
each other’s DOM objects, but IPC Inspectiondisallows 
this interaction because they are separate instances. We 
are aware of only one legitimate use of this embedding 
pattern, which is to facilitate cross-origin communication 
between two sites in browsers that lack postMessage 
support. However, modern browsers that support device 
APIs will also support postMessage, obviating the 
need for the embedding. 


Requesters. Requesters do not need to make any 
changes to their applications because browser device per- 
missions are currently all time-of-use. When a deputy 
makes an API call on behalf of a requester, the browser 
will display a time-of-use prompt that asks the user to 
grant the permission to the deputy and requesters. 


2We do not break the case where a.com opens two child frames 
from b.com; both will be placed in the same instance and will have 
access to each other’s heaps as expected. 


USENIX Association 


getCurrentPosition 


map.com — 


camera.take() 


goToUserLoc 


photos.com 
outer.com 





Figure 8: Left: an unprivileged website opens a mapping ser- 
vice that uses the user’s location. Right: a privileged website 
includes unprivileged advertisements. 


Examples. Both postMessage and device APIs are 
too new for widespread support and use, so we cannot 
measure the impact of IPC Inspection on real-world ap- 
plications. Instead, we present two example cases of ap- 
plications interacting with each other (Figure 8). 

In the first example, a website (outer. com) opens a 
mapping service. Outer.com has no permissions, so 
the mapping service is opened as a new instance with no 
permissions. When outer.com asks map.com to dis- 
play the user’s current position on the map, map.com 
asks the browser for the current location. The browser 
would then prompt the user to give outer.com and 
map .comaccess to the current location. 

In the second example, the user has granted camera 
access to a photo sharing website. The photo sharing 
website loads a frame containing advertisements. The 
ad site does not have any permissions, so it does not 
lose any permissions when it is opened. The photo 
sharing website can send messages to the advertising 
site without any changes to either party’s permissions. 
However, a postMessage from the advertising site 
to photos.com would remove photos.com’s cam- 
era permission. If photos.com were to use the cam- 
era again after receiving a message from ads. com, the 
user would be presented with a prompt asking to approve 
camera access for both photos. comand ads.com. If 
the photo sharing site wishes to avoid becoming a deputy, 
it should refuse postMessages from the advertise- 
ment. If it needs to receive replies from the ad, it can use 
our proposed request-reply variant of postMessage 
when communicating with the ad (Section 6.2). 


8.4 Performance 


The performance cost of IPC Inspection depends on the 
workload, i.e., the set of running applications. 


No Deputies. If the workload does not contain any per- 
mission re-delegation, then there is no cost. This oc- 
curs when applications don’t communicate, messages are 
only being sent to applications with no permissions, or 
the requesters have as many permissions as the deputies. 
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Singletons. A singleton is an application that is never 
duplicated and has only one set of current permissions. 
Top-level windows in the browser are singletons, as are 
self-identified singleton applications in Android. If priv- 
ilege reduction applies to a singleton, then the cost of 
IPC Inspection is (1) removing permissions from a list or 
hash map and (2) adding the removed permissions to a 
hash map that records the reason for removal. Neither is 
an expensive operation. 


Instances. The primary cost of IPC Inspection occurs 
when privilege reduction requires the creation of a new 
instance. In a browser, new instances are created for 
child frames. The frame needs to be opened, regard- 
less of whether it is a new instance; the difference with 
IPC Inspection is that the browser gives the child frame a 
unique entry in the permission assignment map, separate 
from the main application. This is a small cost. 

In Android, the creation of a new instance might mean 
that multiple versions of the same application are run- 
ning simultaneously, in different processes and virtual 
machines. Given the battery and memory constraints of 
a mobile phone, this does not scale well. However, we 
do not expect many instances to be open simultaneously. 
The standard pattern for legitimate communication is as 
follows: (1) the requester opens a target Activity, (2) the 
user performs an action such as selecting a contact or ap- 
proving an e-mail, (3) the target Activity closes and the 
requester regains control of the screen. The instance only 
needs to exist while the Activity is open. Only one Activ- 
ity can be open at once, so we expect that in most cases 
only one additional instance would be open at a time. 


9 Related Work 


Browser Defenses. Major browser vendors remove the 
geolocation permission from iframes, so that the user 
must re-approve the geolocation permission for every 
parent-child window pair. This agrees with our proposal. 
However, we suggest that these rules also be extended to 
top-level windows that interact with each other. 


Android. Three pieces of concurrent work address sim- 
ilar issues. Davi et al. discuss permission re-delegation 
attacks on Android [8]. They introduce the problem and 
present an attack on a vulnerable deputy. We perform 
a larger analysis of applications and discuss how plat- 
forms need to change to prevent these attacks. Chin 
et al. present ComDroid [6], a static analysis tool that 
aims to help prevent developers from accidentally mak- 
ing components public. They also make recommen- 
dations for changes to the Android platform to reduce 
the rate of unintentional deputies. Although their tool 
and their platform recommendations would help prevent 
some instances of permission re-delegation, attacks on 
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intentional deputies would still remain. Dietz et al. built 
Quire [10], an extension to the Android IPC mechanisms 
that helps developers avoid permission re-delegation at- 
tacks. Quire annotates IPCs so that an application can 
check the full chain of applications responsible for an 
IPC call. This addresses the same problem as IPC In- 
spection but does not force developer compliance. 

Past work has also discussed Android permission us- 
age. TaintDroid [12] performs dynamic taint analysis. It 
tracks the real-time flow of sensitive data through appli- 
cations to detect inappropriate sharing. The taint source 
is API data, and the network is the sink. They track 
only data flow, but not control flow. TaintDroid is com- 
plementary to IPC Inspection because they track API 
return values but do not prevent API calls from being 
made. Another tool, ScanDroid [21], uses static anal- 
ysis to determine data flow through Android applica- 
tions; it is intended for use similar to TaintDroid. Scan- 
Droid, however, requires access to application source 
code. Kirin [13] checks application permission require- 
ments and recommends against the installation of ap- 
plications with certain permission combinations. Their 
rules are intended to help detect malware, and they do 
not consider application interaction as a capability. 


HBAC. IPC Inspection revises and extends History- 
Based Access Control (HBAC) [1]. The two approaches 
share a core idea: application permissions are reduced af- 
ter inter-application interactions. HBAC is intended for 
use within a runtime; their permissions apply to threads, 
and privilege reduction follows function calls. We apply 
IPC Inspection to the application platform itself and cre- 
ate rules appropriate for that context. Our design places a 
high priority on application functionality and ease of de- 
velopment. Unlike HBAC, we do not reduce privileges 
following return values or permit explicit rights restora- 
tion. We introduce the concept of multiple instances of 
an application to prevent privilege reduction from im- 
pacting the application as a whole. 


Stack Inspection. IPC Inspection has the same seman- 
tics as stack inspection, but permission checks are asso- 
ciated with IPC rather than method calls. We do not de- 
pend on the stack, so event-driven code does not present 
a problem. We make message queues explicit and exter- 
nal, by servicing each message with a new application 
instance. IPC Inspection is also runtime- and language- 
independent. 


DIFC. Decentralized information flow control (DIFC) 
lets applications explicitly express their information flow 
policies to the operating system or a language runtime, 
which then enforces the policies [30, 26, 39, 11]. DIFC is 
not suitable for the problem of permission re-delegation 
because the access control policy for user-controlled re- 
sources is centrally decided by the user, not applications. 
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IPC Inspection is more similar to centralized information 
flow control, but IPC Inspection is deployed at the appli- 
cation level rather than the variable level. 


Low Watermark. IPC inspection carries out the seman- 
tics of Biba’s low watermark model [4] in that subjects 
are application instances, and the integrity level of an ap- 
plication instance is determined by the permissions that 
the user has granted to the application instance. When 
Instance A sends an IPC message to Instance B, the mes- 
sage represents objects with the same integrity level as 
that of Instance A. If Integrity(A) < Integrity (B) (mean- 
ing B contains permissions that A does not), then B re- 
moves the permissions that A does not have. 

LOMAC [20] applies Biba’s low watermark mode in a 
different way. LOMAC aims to prevent (malicious) low 
integrity content from tampering with high integrity pro- 
gram execution, whereas IPC Inspection is intended to 
prevent less-privileged (low integrity) applications from 
using the additional privileges belonging to another (high 
integrity) application. In LOMAC, a subject is a job 
(which contains multiple application instances) and an 
object is data. Integrity levels for objects are assigned 
based on the sources for the objects. For example, Inter- 
net objects are at a lower integrity level than local data. 
Named pipes and shared memory are considered objects 
with integrity levels. Subjects’ integrity levels are as- 
signed based on the hierarchy of the jobs. The first set 
of system jobs have the highest integrity level and lower 
levels in the job hierarchy (as jobs spawn new jobs) rep- 
resent lower integrity levels. Both LOMAC and IPC in- 
spection face the “self revocation problem” [20] inherent 
in the Biba’s low watermarking model. The self revoca- 
tion problem occurs when a principal’s privilege reduc- 
tion prevents it from accessing high-integrity level data, 
preventing legitimate functionality. Each scheme has to 
relax the low watermark model slightly to accommodate 
the problem. In our case, we ignore the reply in the non- 
simplex IPC communications. In the case of LOMAC, 
they use jobs as subjects rather than processes. 


CSRF. Like permission re-delegation, cross-site request 
forgery (CSRF) is a confused deputy attack that occurs 
in browsers [40]. However, CSRF attacks are targeted at 
server-side resources. CSRF defenses rely on developer 
participation and require changes to servers [2]. 


10 Conclusion 


We discuss permission re-delegation as a problem with 
new permission systems. Permission re-delegation oc- 
curs when a deputy delegates a user-controlled permis- 
sion to an unprivileged application without user autho- 
rization. This is an emerging threat for both the web and 
smartphone platforms. We find that many Android ap- 
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plications are at risk of having permission re-delegation 
vulnerabilities, and we construct attacks that exploit 15 
vulnerabilities in Android system applications. We dis- 
closed our findings and filed bug reports; several of the 
vulnerabilities have been confirmed as bugs. 

We also devise a runtime-independent defense mech- 
anism, IPC Inspection, which transparently protects 
against attacks on confused deputies, with no compatibil- 
ity cost for non-deputies or confused deputies. However, 
intentional deputies and their clients need some modifi- 
cations to work with IPC Inspection. In particular, appli- 
cations that interact with deputies may need to add per- 
missions that they otherwise do not use. We feel that 
the problem of permission re-delegation deserves careful 
attention, and we hope this paper will encourage future 
work on these problems. In particular, we believe static 
analysis of deputies is a promising future area for server- 
side analysis or platforms with installed packages. 
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Appendix 


The 16 pre-installed system applications referenced in 
Section 4 are: Browser, Calendar, Calculator, Camera, 
Contacts, Desk Clock, Email, Gallery, Global Search, 
Launcher, Live Wallpaper, Messaging/Mms, Music, 
Phone, Settings, and SoundRecorder. These pre-installed 
applications must be present on every phone, as set 
forth by the Android 2.2 compatibility definition [22]. 
We built permission re-delegation attacks using Settings, 
DeskClock, Phone, Music, and Launcher. We collected 
the applications from the Android Market on August 27, 
2010 (free) and October 15, 2010 (paid). 

The 20 applications surveyed in Section 8.2 are: Daum 
Maps, Pages Jaunes, Korean IME, Sherpa, Qik, Yan- 
dex Maps, First Aid, Three Stooges, Offi, Cheech and 
Chong, Coupons, Human Body Facts, Android System 
Info, Baidu Input, Bubbles, Hello Kitty Wallpaper, Mu- 
sical Lite, Time2Hunt Free, ModernInfo: BlackOps, and 
Wolfram Alpha. 
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Abstract 

Smartphone apps are often granted to privilege to run 
with access to the network and sensitive local resources. 
This makes it difficult for remote endpoints to place any 
trust in the provenance of network connections originat- 
ing from a user’s device. Even on the phone, different 
apps with distinct privilege sets can communicate with 
one another. This can allow one app to trick another 
into improperly exercising its privileges (resulting in a 
confused deputy attack). In QuirE, we engineered two 
new security mechanisms into Android to address these 
issues. First, Quire tracks the call chain of on device 
IPCs, allowing an app the choice of operating with the 
reduced privileges of its callers or exercising its full priv- 
ilege set by actiing explicitly on its own behalf. Second, 
a lightweight signature scheme allows any app to create 
a signed statement that can be verified by any app on 
the same phone. Both of these mechanisms are reflected 
in network RPCs. This allows remote systems visibility 
into the state of the phone when the RPC was made. We 
demonstrate the usefulness of QuirRE with two example 
applications: an advertising service that runs advertise- 
ments separately from their hosting applications, and a 
remote payment system. We show that Quire’s perfor- 
mance overhead is minimal. 


1 Introduction 


On a smartphone, applications are typically given broad 
permissions to make network connections, access local 
data repositories, and issue requests to other apps on the 
device. For Apple’s iPhone, the only mechanism that 
protects users from malicious apps is the vetting pro- 
cess for an app to get into Apple’s app store. (Apple 
also has the ability to remotely delete apps, although it’s 
something of an emergency-only system.) However, any 
iPhone app might have its own security vulnerabilities, 
perhaps through a buffer overflow attack, which can give 
an attacker full access to the entire phone. 
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The Android platform, in contrast, has no significant 
vetting process before an app is posted to the Android 
Market. Instead, the Android OS insulates apps from 
one another and the underlying Android runtime. Ap- 
plications from different authors run with different Unix 
user ids, containing the damage if an application is com- 
promised. (In this aspect, Android follows a design sim- 
ilar to SubOS [20].) However, this does nothing to de- 
fend a trusted app from being manipulated by a mali- 
cious app via IPC (i.e., a confused deputy attack [18], 
intent stealing/spoofing [9], or other privilege escalation 
attacks [11]). Likewise, there is no mechanism to prevent 
an IPC callee from misrepresenting the intentions of its 
caller to a third party. 

This mutual distrust arises in many mobile applica- 
tions. Consider the example of a mobile advertisement 
system. An application hosting an ad would rather the ad 
run in a distinct process, with its own user-id, so bugs in 
the ad system do not impact the hosting app. Similarly, 
the ad system might not trust its host to display the ad 
correctly, and must be concerned with hosts that try to 
generate fake clicks to inflate their ad revenue. 

To address these concerns, we introduce QuirE, a low- 
overhead security mechanism that provides important 
context in the form of provenance and OS managed data 
security to local and remote apps communicating by IPC 
and RPC respectively. QuirE uses two techniques to pro- 
vide security to communicating applications. 

First, QuirE transparently annotates IPCs occurring 
within the phone such that the recipient of an IPC re- 
quest can observe the full call chain associated with the 
request. When an application wishes to make a network 
RPC, it might well connect to a raw network socket, but 
it would lack credentials that we can build into the OS, 
which can speak to the state of an RPC in a way that 
an app cannot forge. (This contextual information can 
be thought of as a generalization of the information pro- 
vided by the recent HTTP Origin header [2], used by web 
servers to help defeat cross-site request forgery (CSRF) 
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attacks.) 

Second, Quire uses simple cryptographic mechanisms 
to protect data moving over IPC and RPC channels. 
Quire provides a mechanism for an app to tag an object 
with cheap message authentication codes, using keys that 
are shared with a trusted OS service. When data anno- 
tated in this manner moves off the device, the OS can 
verify the signature and speak to the integrity of the mes- 
sage in the RPC. 


Applications. Quire enables a variety of useful appli- 
cations. Consider the case of in-application advertising. 
A large number of free applications include advertise- 
ments from services like AdMob. AdMob is presently 
implemented as a library that runs in the same process 
as the application hosting the ad, creating trivial oppor- 
tunities for the application to spoof information to the 
server, such as claiming an ad is displayed when it isn’t, 
or claiming an ad was clicked when it wasn’t. In QuirE, 
the advertisement service runs as a separate application 
and interacts with the displaying app via IPC calls. The 
remote application’s server can now reliably distinguish 
RPC calls coming from its trusted agent, and can fur- 
ther distinguish legitimate clicks from forgeries, because 
every UI event is tagged with a Message Authentication 
Code(MAC) [21], for which the OS will vouch. 
Consider also the case of payment services. Many 
smartphone apps would like a way to sell things, lever- 
aging payment services from PayPal, Google Checkout, 
and other such services. We would like to enable an ap- 
plication to send a payment request to a local payment 
agent, who can then pass the request on to its remote 
server. The payment agent must be concerned with the 
main app trying to issue fraudulent payment requests, so 
it needs to validate requests with the user. Similarly, the 
main app might be worried about the payment agent mis- 
behaving, so it wants to create unforgeable “purchase or- 
ders” which the payment app cannot corrupt. All of this 
can be easily accomplished with our new mechanisms. 


Challenges. For Quire to be successful, we must ac- 
complish a number of goals. Our design must be suffi- 
ciently general to capture a variety of use cases for aug- 
mented internal and remote communication. Toward that 
end, we build on many concepts from Taos [38], includ- 
ing its compound principals and logic of authentication 
(see Section 2). Our implementation must be fast. Ev- 
ery IPC call in the system must be annotated and must be 
subsequently verifiable without having a significant im- 
pact on throughput, latency, or battery life. (Section 3 de- 
scribes QuirE’s implementation, and Section 5 presents 
our performance measurements.) Quire expands on re- 
lated work from a variety of fields, including existing 


20th USENIX Security Symposium 


Android research, web security, distributed authentica- 
tion logics, and trusted platform measurements (see Sec- 
tion 6). We expect Quire to serve as a platform for future 
work in secure UI design, as a substrate for future re- 
search in web browser engineering, and as starting point 
for a variety of applications (see Section 7). 


2 Design 


Fundamentally, the design goal of Quire is to allow 
apps to reason about the call-chain and data provenance 
of requests, occurring on both a host platform via IPC 
or on a remote server via RPC, before committing to 
any security-relevant decisions. This design goal is 
shared by a variety of other systems, ranging from Java’s 
stack inspection [34, 35] to many newer systems that 
rely on data tainting or information flow control (see, 
e.g., [24, 25, 13]). In Quire, much like in stack inspec- 
tion, we wish to support legacy code without much, if 
any modification. However, unlike stack inspection, we 
don’t want to modify the underlying system to annotate 
and track every method invocation, nor would we like to 
suffer the runtime costs of dynamic data tainting as in 
TaintDroid [13]. We also wish to operate correctly with 
apps that have natively compiled code, not just Java code 
(an issue with traditional stack inspection and with Taint- 
Droid). We observe that in order to accomplish these 
goals, we only need to track calls across IPC boundaries, 
which happen far less frequently than method invoca- 
tions, and which already must pay significant overheads 
for data marshaling, context switching, and copying. 

Stack inspection has the property that the available 
privileges at the end of a call chain represent the intersec- 
tion of the privileges of every app along the chain (more 
on this in Section 2.2), which is good for preventing con- 
fused deputy attacks, but doesn’t solve a variety of other 
problems, such as validating the integrity of individual 
data items as they are passed from one app to another or 
over the network. For that, we need semantics akin to 
digital signatures, but we need to be much more efficient 
as attaching digital signatures to all IPC calls would be 
too slow (more on this in Section 2.3). 


Versus information flow. A design that focuses on 
IPC boundaries is necessarily less precise than dynamic 
taint analysis, but it’s also incredibly flexible. We 
can avoid the need to annotate code with static secu- 
rity policies, as would be required in information flow- 
typed systems like Jif [26]. We similarly do not need 
to poly-instantiate services to ensure that each instance 
only handles a single security label as in systems like 
DStar/HiStar [39] or IPC Inspection [15]. Instead, in 
QurreE, an application which handles requests from mul- 
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tiple callers will pass along an object annotated with the 
originator’s context when it makes downstream requests 
on behalf of the original caller. 


Likewise, where a dynamic tainting system like Taint- 
Droid [13] would generally allow a sensitive operation, 
like learning the phone’s precise GPS location, to occur, 
but would forbid it from flowing to an unprivileged app; 
Qume will carry the unprivileged context through to the 
point where the dangerous operation is about to happen, 
and will then forbid the operation. An information flow 
approach is thus more likely to catch corner cases (e.g., 
where an app caches location data, so no privileged call 
is ever performed), but is also more likely to have false 
positives (where it must conservatively err on the side of 
flagging a flow that is actually just fine). A programmer 
in an information flow system would need to tag these 
false positive corner cases as acceptable, whereas a pro- 
grammer using Quire would need to add additional se- 
curity checks to corner cases that would otherwise be al- 
lowed. 


2.1 Authentication logic and cryptography 


In order to reason about the semantics of QuIRE, we 
need a formal model to express what the various oper- 
ations in Quire will do. Toward that end, we use the 
Abadi et al. [1] (hereafter “ABLP’’) logic of authentica- 
tion, as used in Taos [38]. In this logic, principals make 
statements, which can include various forms of quotation 
(“Alice says Bob says X”’) and authorization (e.g., “Al- 
ice says Bob speaks for Alice”). ABLP nicely models 
the behavior of cryptographic operations, where crypto- 
graphic key material speaks for other principals, and we 
can use this model to reason about cross-process com- 
munication on a device as well as over the network. 

For the remainder of the current section, we will flesh 
out Quire’s IPC and RPC design in terms of ABLP and 
the cryptographic mechanisms we have adopted. 


2.2 IPC provenance 


Android IPC background. The application separa- 
tion that Android relies on to protect apps from one an- 
other has an interesting side effect; whenever two appli- 
cations wish to communicate they must do so via An- 
droid’s Binder IPC mechanism. All cross application 
communication occurs over these Binder IPC channels, 
from clicks delivered from the OS to an app to requests 
for sensitive resources like a users list of contacts or GPS 
location. It is therefore critically important to protect 
these inter-application communication channels against 
attack. 
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UID: 1 
Call Chain: () 


UID: 2 
Call Chain: (1) 
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TrustedMapper 


Operating System 
Call chain: (1,2,3) 
1 —no GPS 


2 — GPS okay 
3 —» GPS okay 


PrivilegeManager 


Figure |: Defeating confused deputy attacks. 


Quire IPC design. The goal of Quire’s IPC prove- 
nance system is to allow endpoints that protect sensitive 
resources, like a user’s fine grained GPS data or contact 
information, to reason about the complete IPC call-chain 
of a request for the resource before granting access to the 
requesting app. 

Que realizes this goal by modifying the Android IPC 
layer to automatically build calling context as an IPC 
call-chain is formed. Consider a call-chain where three 
principals A, B, and C, are communicating. If A calls B 
who then calls C without keeping track of the call-stack, 
C only knows that B initiated a request to it, not that 
the call from A prompted B to make the call to C. This 
loss of context can have significant security implications 
in a system like Android where permissions are directly 
linked to the identity of the principal requesting access to 
a sensitive resource. 

To address this, Quire’s design is for any given callee 
to retain its caller’s call-chain and pass this to every 
downstream callee. The callee will automatically have 
its caller’s principal prepended to the ABLP statement. 
In our above scenario, C will receive a statement “B says 
A says Ok’, where OK is an abstract token representing 
that the given resource is authorized to be used. It’s now 
the burden of C (or Quire’s privilege manager, operat- 
ing on C’s behalf) to prove Ok. As Wallach et al. [35] 
demonstrated, this is equivalent to validating that each 
principal in the calling chain is individually allowed to 
perform the action in question. 


Confused and intentional deputies. The current An- 
droid permission system ties an apps permissions to the 
unique user-id it is assigned at install time. The Android 
system then resolves the user-id of an app requesting ac- 
cess to a sensitive resource into a permission set that de- 
termines if the app’s request for the resource will suc- 
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ceed. This approach to permissions enables applications 
that have permission to access a resource to act as both 
intentional and confused deputies. The current Android 
permission model assumes that all apps act as intentional 
deputies, that is they resolve and check the user-id and 
permission set of a calling application that triggers the 
callee app to issue a request for a sensitve resource be- 
fore issuing the request to the resource. 

An app that protects a sensitive resource and blindly 
handles requests from callees to the protected resource 
is said to be acting as a confused deputy because it is 
unaware that it is doing dangerous actions on behalf of 
a caller who doesn’t have the necessary permissions. In 
reality, app developers rarely intend to create a confused 
deputy; instead, they may simply fail to consider that a 
dangerous operation is in play, and thus fail to take any 
precautions. 

The goal of the IPC extensions in Quire are to provide 
enough additional security context to prevent confused 
deputy attacks while still enabling an application to act 
as an intentional deputy if it chooses to do so. To defeat 
confused deputy attacks, we simply check if any one of 
the principals in the call chain is not privileged for the 
action being taken; in these cases, permission is denied. 
Figure | shows this in the context of an evil application, 
lacking fine-grained location privileges, which is trying 
to abuse the privileges of a trusted mapping program, 
which happens to have that privilege. The mapping ap- 
plication, never realizing that its helpful API might be a 
security vulnerability, naively and automatically passes 
along the call chain along to the location service. The 
location service then uses the call chain to prove (or dis- 
prove) that the request for fine-grained location show be 
allowed. 

As with traditional stack inspection, there will be 
times that an app genuinely wishes to exercise a priv- 
ilege, regardless of its caller’s lack of the same privi- 
lege. Stack inspection solves this with an enablePriv- 
ilege primitive that, in the ABLP logic, simply doesn’t 
pass along the caller’s call stack information. The callee, 
after privileges are enabled, gets only the immediate 
caller’s identity. (In the example of Figure 1, the trusted 
mapper would drop the evil app from the call chain, and 
the location provider would only hear that the trusted 
mapper application wishes to use the service.) 

Our design is, in effect, an example of the “security 
passing style” transformation [35], where security be- 
liefs are passed explicitly as an IPC argument rather than 
passed implicitly as annotations on the call stack. One 
beneficial consequence of this is that a callee might well 
save the statement made by its caller and reuse them at 
a later time, perhaps if they queue requests for later pro- 
cessing, in order to properly modulate the privilege level 
of outgoing requests. 
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Security analysis. While apps, by default, will pass 
along call chain information without modification, QuIRE 
allows a caller to forge the identities of its antecedent 
callers. They are simply strings passed along from caller 
to callee. Enabling this misrepresentation would seem 
to enable serious security vulnerabilities, but there is no 
incentive for a caller to lie, since the addition of any an- 
tecedent principals strictly reduces the privileges of the 
caller. Of course, there will be circumstances when a 
caller wants to take an action that will result in increased 
privileges for a downstream callee. Toward that end, 
Quire provides a mechanism for verifiable statements 
(see Section 2.3). 

In our design, we require the callee to learn the caller’s 
identity in an unforgeable fashion. The callee then 
prepends the “Caller says” tokens to the statement it 
hears from the caller, using information that is available 
as part of every Android Binder IPC, any lack of privi- 
leges on the caller’s part will be properly reflected when 
the privileges for the trusted operation are later evaluated. 

Furthermore, our design is lightweight; we can con- 
struct and propagate IPC call chains with little impact on 
IPC performance (see Section 5). 


2.3 Verifiable statements 


Stack inspection semantics are helpful, but are not suf- 
ficient for many security needs. We envision a variety 
of scenarios where we will need semantics equivalent to 
digital signatures, but with much better performance than 
public-key cryptographic operations. 


Definition. A verifiable statement is a 3-tuple 
[P, M, A(M)p] where P is the principal that said message 
M, and A(M)p is an authentication token that can be 
used by the Authority Manager OS service to verify P 
said M. In ABLP, this tuple represents the statement “P 
says M.” 

In order to operate without requiring slow public-key 
cryptographic operations, we have two main choices. We 
could adopt some sort of central registry of statements, 
perhaps managed inside the kernel. This would require a 
context switch every time a new statement is made, and 
it would also require the kernel to store these statements 
in a cache with some sort of timeout strategy to avoid a 
memory use explosion. 

The alternative is to adopt a symmetric-key cryp- 
tographic mechanism, such as message authentication 
codes (MAC). MAC functions, like HMAC-SHA1, run 
several orders of magnitude faster than digital signature 
functions like DSA, but MAC functions require a shared 
key between the generator and verifier of a MAC. To 
avoid an N* key explosion, we must have every appli- 
cation share a key with a central, trusted authority man- 
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ager. As such, any app can produce a statement “App 
says M”, purely by computing a MAC with its secret 
key. However, for a second app to verify it, it must send 
the statement to the authority manager. If the authority 
manager says the MAC is valid, then the second app will 
believe the veracity of the statement. 

There are two benefits of the MAC design over the 
kernel statement registry. First, it requires no context 
switches when statements are generated. Context switch- 
ing is only necessary when a statement is verified, which 
we expect to happen far less often. Second, the MAC 
design requires no kernel-level caching strategy. Instead, 
signed statements are just another element in the mar- 
shaled data being passed via IPC. The memory used for 
them will be reclaimed whenever the rest of the message 
buffer is reclaimed. Consequently, there is no risk that 
an older MAC statement will become unverifiable due to 
cache eviction. 


2.4 RPC attestations 


When moving from on-device IPCs to Internet RPCs, 
some of the properties that we rely on to secure on-device 
communication disappear. Most notably, the receiver of 
a call can no longer open a channel to talk to the author- 
ity manager, even if they did trust it!. To combat this, 
Quire’s design requires an additional “network provider” 
system service, which can speak over the network, on be- 
half of statements made on the phone. This will require it 
to speak with a cryptographic secret that is not available 
to any applications on the system. 

One method for getting such a secret key is to have 
the phone manufacturer embed a signed X.509 certifi- 
cate, along with the corresponding private key, in trusted 
storage which is only accessible to the OS kernel. This 
certificate can be used to establish a client-authenticated 
TLS connection to a remote service, with the remote 
server using the presence of the client certificate, as en- 
dorsed by a trusted certification authority, to provide con- 
fidence that it is really communicating with the QuiRE 
phone’s operating system, rather than an application at- 
tempting to impersonate the OS. With this attestation- 
carrying encrypted channel in place, RPCs can then carry 
a serialized form of the same statements passed along in 
Quire IPCs, including both call chains and signed state- 
ments, with the network provider trusted to speak on be- 
half of the activity inside the phone. 

All of this can be transmitted in a variety of ways, 
such as a new HTTP header. Regular Quire applica- 
tions would be able to speak through this channel, but 
the new HTTP headers, with their security-relevant con- 


'Like it or not, with NATs, firewalls, and other such impediments 
to bi-directional connectivity, we can only reliably assume that a phone 
can make outbound TCP connections, not receive inbound ones. 
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textual information, would not be accessible to or forge- 
able by the applications making RPCs. (QuirE RPCs are 
analogous to the HTTP origin header [2], generated by 
modern web browsers, but QuirRE RPCs carry the full call 
chain as well as any MAC statements, giving significant 
additional context to the RPC server.) 


The strength of this security context information is 
limited by the ability of the device and the OS to pro- 
tect the key material. If a malicious application can 
extract the private key, then it would be able to send 
messages with arbitrary claims about the provenance of 
the request. This leads us inevitably to techniques from 
the field of trusted platform measurement (TPM), where 
stored cryptographic key material is rendered unavailable 
unless the kernel was properly validated when it booted. 
TPM chips are common in many of today’s laptops and 
could well be installed in future smartphones. 


Even without TPM hardware, Android phones gen- 
erally prohibit applications from running with full root 
privileges, allowing the kernel to protect its data from 
malicious apps. Of course, there may well always be se- 
curity vulnerabilities in trusted applications. These could 
be exploited by malicious apps to amplify their privi- 
leges; they’re also exploited by tools that allow users 
to “root” their phones, typically to work around carrier- 
instituted restrictions such as forbidding phones from 
freely relaying cellular data services as WiFi hotspots. 
Once a user has “rooted” an Android phone, apps can 
then request “super user” privileges, which if granted 
would allow the generation of arbitrary signed state- 
ments. 

While this is far from ideal, we note that Google and 
other Android vendors are already strongly incentivized 
to fix these security holes, and that most users will never 
go to the trouble of rooting their phones. Consequently, 
an RPC server can treat the additional context informa- 
tion provided by Quire as a useful signal for fraud pre- 
vention, but other server-side mechanisms (e.g., anomaly 
detection) will remain a valuable part of any overall de- 
sign. 


Privacy. An interesting concern arises with our design: 
Every RPC call made from Quire uses the unique pub- 
lic key assigned to that phone. Presumably, the public 
key certificate would contain a variety of identifying in- 
formation, thus making every RPC personally identify 
the owner of the phone. This may well be desirable 
in some circumstances, notably allowing web services 
with Android applications acting as frontends to com- 
pletely eliminate any need for username/password di- 
alogs. However, it’s clearly undesirable in other cases. 
To address this very issue, the Trusted Computing Group 
has designed what it calls “direct anonymous attesta- 
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tion’”?, using cryptographic group signatures to allow the 


caller to prove that it knows one of a large group of re- 
lated private keys without saying anything about which 
one [8]. This will make it impossible to correlate multi- 
ple connections from the same phone. A production im- 
plementation of Quire could certainly switch from TLS 
client-auth to some form of anonymous attestation with- 
out a significant performance impact. 

An interesting challenge, for future work, is being able 
to switch from anonymous attestation, in the default case, 
to classical client-authentication, in cases where it might 
be desirable. One notable challenge of this would be 
working around users who will click affirmatively on any 
“okay / cancel” dialog that’s presented to them without 
ever bothering to read it. Perhaps this could be finessed 
with an Android privilege that is requested at the time 
an application is installed. Unprivileged apps can only 
make anonymous attestations, while more trusted apps 
can make attestations that uniquely identify the specific 
user/phone. 


2.5. Drawbacks and circumvention 


The design of Quire makes no attempt to prevent a mali- 
cious deputy from circumventing the security constructs 
introduced in Quire. For example a malicious attacker 
could create two collaborating applications, one with in- 
ternet permission and one with GPS permission, to cir- 
cumvent Chinese Wall-style policies [5] that might re- 
quire that the GPS provider never deliver GPS informa- 
tion to an app with internet permission. Such malicious 
interactions can be detected and averted by systems like 
TaintDroid [13] and XManDroid [6]. We are primarily 
concerned with preventing benign applications from act- 
ing as confused deputies while still enabling apps to ex- 
ercise their full permission sets as intentional deputies 
when needed. 


3 Implementation 


QuirE is implemented as a set of extensions to the exist- 
ing Android Java runtime libraries and Binder IPC sys- 
tem. The authority manager and network provider are 
trusted components and therefore implemented as OS 
level services while our modified Android interface def- 
inition language code generator provides IPC stub code 
that allows applications to propagate and adopt an IPC 
call-stack. The result, which is implemented in around 
1300 lines of Java and C++ code, is an extension to 
the existing Android OS that provides locally verifi- 
able statements, IPC provenance, and authenticated RPC 


7 http://www.zurich.ibm.com/security/daa/ 
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for QuiRE-aware applications and backward compatibil- 
ity for existing Android applications. 


3.1. On- and off-phone principals 


The Android architecture sandboxes applications such 
that apps from different sources run as different Unix 
users. Standard Android features also allow us to resolve 
user-ids into human-readable names and permission sets, 
based on the applications’ origins. Based on these fea- 
tures, the prototype QuirE implementation defines prin- 
cipals as the tuple of a user-id and process-id. We include 
the process-id component to allow the recipient of an IPC 
method call to stipulate policies that force the process-id 
of a communication partner to remain unchanged across 
a series of calls. (This feature is largely ignored in the 
applications we have implemented for testing and evalu- 
ation purposes, but it might be useful later.) 

While principals defined by user-id/process-id tuples 
are sufficient for the identification of an application on 
the phone, they are meaningless to a remote service. 
However, the Android system requires all applications 
to be signed by their developers. The public key used 
for signing the application can be used as part of the 
identity of the application. Quire therefore resolves the 
user-id/process-id tuples used in IPC call-chains into an 
externally meaningful string consisting of the marshaled 
chain of application names and public keys when RPC 
communication is invoked to move data off the phone. 
This lazy resolution of IPC principals allows Quire to re- 
duce the memory footprint of statements when perform- 
ing IPC calls at the cost of extra effort when RPCs are 
performed. 


3.2 Authority management 


The Authority Manager discussed in Section 2 is imple- 
mented as a system service that runs within the operating 
system’s reserved user-id space. The interface exposed 
by the service allows userspace applications to request 
a shared secret, submit a statement for verification, or 
request the resolution of the principal included in a state- 
ment into an externally meaningful form. 

When an application requests a key from the authority 
manager, the Authority Manager maintains a table map- 
ping user-id / process-id tuples to the key. It is important 
to note that a subsequent request from the same applica- 
tion will prompt the Authority Manager to create a new 
key for the calling application and replace the previous 
stored key in the lookup table. This prevents attacks that 
might try to exploit the reuse of user-ids and process-ids 
as applications come and go over time. Needless to say, 
the Authority Manager is a system service that must be 
trusted and separated from other apps. 
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3.3 Verifiable statements 


Section 2.3 introduced the idea of attaching an OS veri- 
fiable statement to an object in order to allow principals 
later in a call-chain to verify the authenticity and integrity 
of a received object. 

Our implementation of this abstract concept involves 
a parcelable statement object that consists of a principal 
identifier as well as an authentication token. When this 
statement object is attached to a parcelable object, the an- 
notated object contains all the information necessary for 
the Authority Manager service to validate the authentica- 
tion token contained within the statement. Therefore the 
annotated object can be sent over Android’s IPC chan- 
nels and later delivered to the Quire Authority Manger 
for verification by the OS. 

Quire’s verifiable statement implementation estab- 
lishes the authenticity of message with HMAC-SHA1, 
which proved to be exceptionally efficient for our needs, 
while still providing the authentication and integrity se- 
mantics required by QuirE. 

Even with HMAC-SHA1, speed still matters. In prac- 
tice, doing HMAC-SHAI in pure Java was still slow 
enough to be an issue. We resolved this by using a native 
C implementation from OpenSSL and exposing it to Java 
code as a Dalvik VM intrinsic function, rather than a JNI 
native method. This eliminated unnecessary copying and 
runs at full native speed (see Section 5.2.1). 


3.4 Code generator 


The key to the stack inspection semantics that QuiRE pro- 
vides is an extension to the Android Interface Definition 
Language (AIDL) code generator. This piece of software 
is responsible for taking in a generalized interface defini- 
tion and creating stub and proxy code to facilitate Binder 
IPC communication over the interface as defined in the 
AIDL file. 

The Quire code generator differs from the stock An- 
droid code generator in that it adds directives to the mar- 
shaling and unmarshaling phase of the stubs that pulls 
the call-chain context from the calling app and attaches 
it to the outgoing IPC message for the callee to retrieve. 
These directives allow for the “quoting” semantics that 
form the basis of a stack inspection based policy system. 

Our prototype implementation of the Quire AIDL 
code generator requires that an application developer 
specify that an AIDL method become “QuirRE aware” 
by defining the method with a reserved auth flag in the 
AIDL input file. This flag informs the QuirE code gen- 
erator to produce additional proxy and stub code for the 
given method that enables the propagation and delivery 
of the call-chain context to the specified method. A pro- 
duction implementation would pass this information im- 
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plicitly on all IPC calls. 

In addition to enabling quoting semantics, the mod- 
ified code generator also exposes helper functions that 
wrap the generation (and storage) of a shared secret with 
the OS Authority Manager and the creation and trans- 
mission of a verifiable statement to a communicating IPC 
endpoint. 


4 Applications 


We built two different applications to demonstrate the 
benefits of Quire’s infrastructure. 


4.1 Click fraud prevention 


Current Android-based advertising systems, such as Ad- 
Mob, are deployed as a library that an app includes as 
part of its distribution. So far as the Android OS is con- 
cerned, the app and its ads are operating within single do- 
main, indistinguishable from one another. Furthermore, 
because advertisement services need to report their ac- 
tivity to a network service, any ad-supported app must 
request network privileges, even if the app, by itself, 
doesn’t need them. 

From a security perspective, mashing these two dis- 
tinct security domains together into a single app creates 
a variety of problems. In addition to requiring network- 
access privileges, the lack of isolation between the adver- 
tisement code and its host creates all kinds of opportuni- 
ties for fraud. The hosting app might modify the adver- 
tisement library to generate fake clicks and real revenue. 

This sort of click fraud is also a serious issue on the 
web, and it’s typically addressed by placing the adver- 
tisements within an iframe, creating a separate protec- 
tion domain and providing some mutual protection. To 
achieve something similar with QuirE, we needed to ex- 
tend Android’s UI layer and leverage Quire’s features to 
authenticate indirect messages, such as UI events, dele- 
gated from the parent app to the child advertisement app. 


Design challenges. Fundamentally, our design re- 
quires two separate apps to be stacked (see Figure 2), 
with the primary application on top, and opening a trans- 
parent hole through which the subordinate advertising 
application can be seen by the user. This immediately 
raises two challenges. First, how can the advertising app 
know that it’s actually visible to the user, versus being 
obscured by the application? And second, how can the 
advertising app know that the clicks and other UI events 
it receives were legitimately generated by the user, versus 
being synthesized or replayed by the primary application. 
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Figure 2: The host and advertisment apps. 


Stacking the apps. This was straightforward to im- 
plement. The hosting application implements a translu- 
cent theme (Theme. Translucent), making the background 
activity visible. When an activity containing an ad- 
vertisement is started or resumed, we modified the ac- 
tivity launch logic system to ensure that the advertise- 
ment activity is placed below the associated host activ- 
ities. When a user event is delivered to the AppFrame 
view, it sends the event along with the current location of 
AppFrame in the window to the an advertisement event 
service. This allows our prototype to correctly display 
the two apps together. 


Visibility. Android allows an app to continue running, 
even when it’s not on the screen. Assuming our ad ser- 
vice is built around payments per click, rather than per 
view, we’re primarily interested in knowing, at the mo- 
ment that a click occurred, that the advertisement was 
actually visible. Android 2.3 added a new feature where 
motion events contain an “obscured” flag that tells us 
precisely the necessary information. The only challenge 
is knowing that the MotionEvent we received was legiti- 
mate and fresh. 


Verifying events. With our stacked app design, motion 
events are delivered to the host app, on top of the stack. 
The host app then recognizes when an event occurs in the 
advertisement’s region and passes the event along. To 
complicate matters, Android 2.3 reengineered the event 
system to lower the latency, a feature desired by game 
designers. Events are now transmitted through shared 
memory buffers, below the Java layer. 

In our design, we leverage Quire’s signed statements. 
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Figure 3: Secure event delivery from host app to adver- 
tisement app. 


We modified the event system to augment every Motion- 
Event (as many as 60 per second) with one of our MAC- 
based signatures. This means we don’t have to worry 
about tampering or other corruption in the event sys- 
tem. Instead, once an event arrives at the advertisment 
app, it first validates the statement, then validates that 
it’s not obscured, and finally validates the timestamp in 
the event, to make sure the click is fresh. This process is 
summarized in Figure 3. 

At this point, the local advertising application can now 
be satisfied that the click was legitimate and that the ad 
was visible when the click occurred and it can communi- 
cate that fact over the Internet, unspoofably, with QuirE’s 
RPC service. 

All said and done, we added around 500 lines of Java 
code for modifying the activity launch process, plus a 
modest amount of C code to generate the signatures. 
While our implementation does not deal with every pos- 
sible scenario (e.g., changes in orientation, killing of the 
advertisement app due to low memory, and other such 
things) it still demonstrates the feasibility of hosting of 
advertisement in separate processes and defeating click 
fraud attacks. 


4.2 PayBuddy 


To demonstrate the usefulness of QuirE for RPCs, we 
implemented a micropayment application called Pay- 
Buddy: a standalone Android application which exposes 
an activity to other applications on the device to allow 
those applications to request payments. 

This is a scenario which requires a high degree of co- 
operation between many parties, but at the same time in- 
volves a high degree of mutual distrust. The user may 
not trust the application not to steal his banking infor- 
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mation, while the application may not trust the user to 
faithfully make the required payment. Similarly, the ap- 
plication may not trust that the PayBuddy application on 
the phone is legitimate, while the PayBuddy application 
may not trust that the user has been accurately notified of 
the proper amount to be charged. Finally, the service side 
of PayBuddy may not trust that the legitimate PayBuddy 
application is the application that is submitting the pay- 
ment request. We designed PayBuddy to consider all of 
these sources of distrust. 

To demonstrate how PayBuddy works, consider the 
example shown in Figure 4. Application ExampleApp 
wishes to allow the user to make an in-app purchase. 
To do this, ExampleApp creates and serializes a pur- 
chase order object and signs it with its MAC key ky. 
It then sends the signed object to the PayBuddy appli- 
cation, which can then prompt the user to confirm their 
intent to make the payment. After this, PayBuddy passes 
the purchase order along to the operating system’s Net- 
work Provider. At this point, the Network Provider can 
verify the signature on the purchase order, and also that 
the request came from the PayBuddy application. It then 
sends the request to the PayBuddy.com server over a 
client-authenticated HTTPS connection. The contents of 
ExampleApp’s purchase order are included in an HTTP 
header, as is the call chain (““ExampleApp, PayBuddy’”). 

At the end of this, PayBuddy.com knows the follow- 
ing: 


e The request came from a particular device with a 
given certificate. 


e The purchase order originated from ExampleApp 
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and was not tampered with by the PayBuddy appli- 
cation. 


e The PayBuddy application approved the request 
(which means that the user gave their explicit con- 
sent to the purchase order). 


At the end of this, if PayBuddy.com accepts the trans- 
action, it can take whatever action accompanies the suc- 
cessful payment (e.g., returning a transaction ID that 
ExampleApp might send to its home server in order to 
download a new level for a game). 


Security analysis. Our design has several curious 
properties. Most notably, the ExampleApp and the Pay- 
Buddy app are mutually distrusting of each other. 

The PayBuddy app doesn’t trust the payment request 
to be legitimate, so it can present an “okay/cancel” dialog 
to the user. In that dialog, it can include the cost as well 
as the ExampleApp name, which it received through the 
Que call chain. Since ExampleApp is the direct caller, 
its name cannot be forged. The PayBuddy app will only 
communicate with the PayBuddy.com server if the user 
approves the transaction. 

Similarly, ExampleApp has only a limited amount of 
trust in the PayBuddy app. By signing its purchase or- 
der, and including a unique order number of some sort, 
a compromised PayBuddy app cannot modify or replay 
the message. Because the OS’s net provider is trusted to 
speak on behalf of both the ExampleApp and the Pay- 
Buddy app, the remote PayBuddy.com server gets am- 
ple context to understand what happened on the phone 
and deal with cases where a user later tries to repudiate a 
payment. 

Lastly, the user’s PayBuddy credentials are never vis- 
ible to ExampleApp in any way. Once the PayBuddy 
app is bound, at install time, to the user’s matching ac- 
count on PayBuddy.com, there will be no subsequent 
username/password dialogs. All the user will see is an 
okay/cancel dialog. This will reduce the number of user- 
name/password dialogs that the user sees in normal us- 
age, which will make entering username and password 
an exceptional situation. Once users are accustomed to 
this, they may be more likely to react with skepticism 
when presented with a phishing attack that demands their 
PayBuddy credentials. (A phishing attack that’s com- 
pletely faithful to the proper PayBuddy user interface 
would only present an okay/cancel dialog, which yields 
no useful information for the attacker.) 


Google’s in-app billing. After we implemented Pay- 
Buddy, Google released their own micropayment sys- 
tem. Their system leverages a private key shared be- 
tween Google and each application developer to enable 
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the on-phone application to verify that confirmations are 
coming from Google’s Market servers. However, unlike 
PayBuddy, the messages from the Market application to 
the server do not contain OS-signed statements from the 
requesting application and the Market app. If the Market 
app were tampered by an attacker, this could allow for a 
variety of compromises that Quire would defeat. 

Also, while Google’s in-app billing is built on Google- 
specific infrastructure, like its Market app, QuirE’s de- 
sign provides general-purpose infrastructure that can be 
used by PayBuddy or any other app. 

One last difference: PayBuddy returns a transaction 
ID to the app which requested payment. The app must 
then make a new RPC to the payment server or to its 
own server to validate the transaction ID against the orig- 
inal request. Google returns a statement that is digitally 
signed by the Market server which can be verified by 
a public key that would be embedded within the app. 
Google’s approach avoids an additional network round 
trip, but they recommend code obfuscation and other 
measures to protect the app from external tampering’. 


5 Performance evaluation 


5.1 Experimental methodology 


All of our experiments were performed on the standard 
Android developer phone, the Nexus One, which has a 
1GHz ARM core (a Qualcomm QSD 8250), 512MB of 
RAM, and 512MB of internal Flash storage. We con- 
ducted our experiments with the phone displaying the 
home screen and running the normal set of applications 
that spawn at start up. We replaced the default “live wall- 
paper” with a static image to eliminate its background 
CPU load. 

All of our benchmarks are measured using the An- 
droid Open Source Project’s (AOSP) Android 2.3 (“Gin- 
gerbread”) as pulled from the AOSP repository on De- 
cember 21st, 2010. QuirE is implemented as a series 
of patches to this code base. We used an unmodified 
Gingerbread build for “control” measurements and com- 
pared that to a build with our Quire features enabled for 
“experimental” measurements. 


5.2. Microbenchmarks 
5.2.1 Signed statements 


Our first micro benchmark of QuirE measures the cost of 
creating and verifying statements of varying sizes. To do 
this, we had an application generate random byte arrays 


3http://developer.android.com/guide/marketbilling/billing_best_ 
practices.html 
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Figure 5: Statement creation and verification time vs 
payload size. 


of varying sizes from 10 bytes to 8000 bytes and mea- 
sured the time to create 1000 signatures of the data, fol- 
lowed by 1000 verifications of the signature. Each set of 
measured signatures and verifications was preceded by a 
priming run to remove any first-run effects. We then took 
an average of the middle 8 out of 10 such runs for each 
size. The large number of runs is due to variance intro- 
duced by garbage collection within the Authority Man- 
ager. Even with this large number of runs, we could not 
fully account for this, leading to some jitter in the mea- 
sured performance of statement verification. 

The results in Figure 5 show that statement creation 
carries a minimal fixed overhead of 20 microseconds 
with an additional cost of 15 microseconds per kilobyte. 
Statement verification, on the other hand, has a much 
higher cost: 556 microseconds fixed and an additional 
96 microseconds per kilobyte. This larger cost is primar- 
ily due to the context switch and attendant copying over- 
head required to ask the Authority Manager to perform 
the verification. However, with statement verification be- 
ing a much less frequent occurrence than statement gen- 
eration, these performance numbers are well within our 
performance targets. 


5.2.2 IPC call-chain tracking 


Our next micro-benchmark measures the additional cost 
of tracking the call chain for an IPC that otherwise per- 
forms no computation. We implemented a service with 
a pair of methods, of which one uses the Quire IPC ex- 
tensions and one does not. These methods both allow us 
to pass a byte array of arbitrary size to them. We then 
measured the total round trip time needed to make each 
of these calls. These results are intended to demonstrate 
the slowdown introduced by the Quire IPC extensions in 
the worst case of a round trip null operation that takes no 


USENIX Association 





1000 





— Quire 
—_ Stock Android 











time (us) 





200; 











i i i i i i 
0 1000 2000 3000 4000 5000 6000 
payload (bytes) 


Figure 6: Roundtrip single step IPC time vs payload size. 





5000 





— Quire 
— Stock Android 
—_ Difference 








4000} 





3000} 


time (us) 


2000; 


1000; 














6 
call chain length 


Figure 7: Roundtrip IPC time vs call chain length. 


action on the receiving end of the IPC method call. 

We discarded performance timings for the first IPC 
call of each run to remove any noise that could have been 
caused by previous activity on the system. The results in 
Figure 6 were obtained by performing 10 runs of 100 tri- 
als each at each size point, with sizes ranging from 0 to 
6336 bytes in 64-byte increments. 

These results show that the overhead of tracking the 
call chain for one hop is around 70 microseconds, which 
is a21% slowdown in the worst case of doing no-op calls. 

We also measured the effect of adding more hops into 
the call chain. This was done by having a chain of iden- 
tical services implementing a service similar to "trace 
route". The payload for each method call was a single 
integer, representing the number of hops remaining. 

The results in Figure 7 show that the overhead of track- 
ing the call chain is under 100 microseconds per hop, 
which is a 20-25% slowdown in the worst case of calls 
which perform no additional work. Even for a call chain 
of 10 applications, the overhead is just | millisecond, 
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which is a slowdown which is well below what would 
be noticed by a user. 


5.2.3. RPC communication 














Statement Depth | Time (us) 
1 770 
2 1045 
4 1912 
8 4576 





Table 1: IPC principal to RPC principal resolution time. 


The next microbenchmark we performed was deter- 
mining the cost of converting from an IPC call-chain into 
a Serialized form that is meaningful to a remote service. 
This includes the IPC overhead in asking the system ser- 
vices to perform this conversion. 

We found that, even for very long statement chains (of 
8 distinct applications), the extra cost of this computation 
is a few milliseconds, which is insignificant compared to 
the other costs associated with setting up and maintain- 
ing a TLS network connection. From this, we conclude 
that QuirE RPCs introduce no meaningful overhead be- 
yond the costs already present in conducting RPCs over 
cryptographically secure connections. 


5.3. HTTPS RPC benchmark 


To understand the impact of using Qurre for calls to re- 
mote servers, we performed some simple RPCs using 
both Quire and a regular HTTPS connection. We called 
a simple echo service that returned a parameter that was 
provided to it. This allowed us to easily measure the ef- 
fect of payload size on latency. We ran these tests on 
a small LAN with a single wireless router and server 
plugged into this router, and using the phone’s WiFi an- 
tenna for connectivity. Each data point is the mean of 10 
runs of 100 trials each, with the highest and lowest times 
thrown out prior to taking the mean to remove anomalies. 

The results in Figure 8 show that Quire adds an ad- 
ditional overhead which averages around 6 ms, with a 
maximum of 13.5 ms, and getting smaller as the payload 
size increases. This extra latency is small enough that it’s 
irrelevant in the face of the latencies experienced across 
typical cellular Internet connections. From this we can 
conclude that the overhead of Quire for network RPC is 
practically insignificant. 


5.4 Analysis 


Our micro-benchmarks demonstrate that adding call- 
chain tracking can be done without a significant perfor- 
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Figure 8: Network RPC latency in milliseconds. 


mance penalty above and beyond that of performing stan- 
dard Android IPCs. Additionally, our RPC benchmarks 
show that the addition of Quire does not cause a signifi- 
cant slowdown relative to standard TLS-encrypted com- 
munications as the RPC latency is dominated by the rela- 
tivly slow speed of an internet conncection vs. on-device 
communication. 

These micro-benchmarks, while useful for demon- 
strating the small scale impact of Quire, do not provide 
valuable context as to the impact QuirE might have on the 
Android user experience. However, our prototype adver- 
tisement service requires each click on the system to be 
annotated and signed and its performance shines a light 
on the full system impact of Quire. We tested the im- 
pact of Quire on touch event throughput by using the 
advertisement system discussed in Section 4 to sign and 
verify every click flowing from the OS through a host 
app to a simple advertisement app. We observed that 
the touch event throughput (which is artifically capped at 
60 events per second by the Android OS) remained un- 
changed even when we chose to verify every touch event. 
This is obviously not a standard use case (as it simulates 
a user spamming 60 clicks per second on an advertise- 
ment), however even in this worst case scenario QuIRE 
does not affect the user experience of the device. 


6 Related work 


6.1 Smart phone platform security 


As mobile phone hardware and software increase in com- 
plexity the security of the code running on a mobile de- 
vices has become a major concern. 

The Kirin system [14] and Security-by-Contract [12] 
focus on enforcing install time application permissions 
within the Android OS and .NET framework respec- 
tively. These approaches to mobile phone security allow 
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a user to protect themselves by enforcing blanket restric- 
tions on what applications may be installed or what in- 
stalled applications may do, but do little to protect the 
user from applications that collaborate to leak data or 
protect applications from one another. 

Saint [29] extends the functionality of the Kirin sys- 
tem to allow for runtime inspection of the full system 
permission state before launching a given application. 
Apex [28] presents another solution for the same prob- 
lem where the user is responsible for defining run-time 
constraints on top of the existing Android permission 
system. Both of these approaches allow users to specify 
Static policies to shield themselves from malicious ap- 
plications, but don’t allow apps to make dynamic policy 
decisions. 

CRePE [10] presents a solution that attempts to artifi- 
cially restrict an application’s permissions based on envi- 
ronmental constraints such as location, noise, and time- 
of-day. While CRePE considers contextual information 
to apply dynamic policy decisions, it does not attempt to 
address privilege escalation attacks. 


6.1.1 Privilege escalation 


XManDroid [6] presents a solution for privilege es- 
calation and collusion by restricting communication at 
runtime between applications where the communication 
could open a path leading to dangerous information flows 
based on Chinese Wall-style policies [5] (e.g., forbidding 
communication between an application with GPS privi- 
leges and an application with Internet access). While this 
does protect against some privilege escalation attacks, 
and allows for enforcing a more flexible range of poli- 
cies, applications may launch denial of service attacks on 
other applications (e.g., connecting to an application and 
thus preventing it from using its full set of permissions) 
and it does not allow the flexibility for an application to 
regain privileges which they lost due to communicating 
with other applications. 

In concurrent work to our own, Felt et al. present a 
solution to what they term “permission re-delegation” at- 
tacks aginst deputies on the Android system [15]. With 
their “IPC inspection” system, apps that receive IPC re- 
quests are poly-instantiated based on the privileges of 
their callers, ensuring that the callee has no greater priv- 
ileges than the caller. IPC inspection addresses the same 
confused deputy attack as Quire’s “security passing” IPC 
annotations, however the approaches differ in how inten- 
tional deputies are handled. With IPC inspection, the 
OS strictly ensures that callees have reduced privileges. 
They have no mechanism for a callee to deliberately of- 
fer a safe interface to an otherwise dangerous primitive. 
Unlike Quire, however, IPC inspection doesn’t require 
apps to be recompiled or any other modifications to be 
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made to how apps make IPC requests. 


6.1.2 Dynamic taint analysis on Android 


The TaintDroid [13] and ParanoidAndroid [30] projects 
present dynamic taint analysis techniques to preventing 
runtime attacks and data leakage. These projects attempt 
to tag objects with metadata in order to track information 
flow and enable policies based on the path that data has 
taken through the system. TaintDroid’s approach to in- 
formation flow control is to restrict the transmission of 
tainted data to a remote server by monitoring the out- 
bound network connections made from the device and 
disallowing tainted data to flow along the outbound chan- 
nels. The goal of Quire differs from that of taint analysis 
in that QuirE is focused on providing provenance infor- 
mation and preventing the access of sensitive data, rather 
than in restricting where data may flow. 

The low level approaches used to tag data also differ 
between the projects. TaintDroid enforces its taint propa- 
gation semantics by instrumenting an application’s DEX 
bytecode to tag every variable, pointer, and IPC mes- 
sage that flows through the system with a taint value. In 
contrast, QuiRE’s approach requires only the IPC subsys- 
tem be modified with no reliance on instrumented code, 
therefore QuirRE can work with applications that use na- 
tive libraries and avoid the overhead imparted by instru- 
menting code to propagate taint values. 


6.2. Decentralized information flow control 


A branch of the information flow control space focuses 
on how to provide taint tracking in the presence of mutu- 
ally distrusting applications and no centralized authority. 
Meyer’s and Liskov’s work on decentralized information 
flow control (DIFC) systems [25, 27] was the first at- 
tempt to solve this problem. Systems like DEFCon [23] 
and Asbestos [33] use DIFC mechanisms to dynamically 
apply security labels and track the taint of events mov- 
ing through a distributed system. These projects and 
Quire are similar in that they both rely on process iso- 
lation and communication via message passing channels 
that label data. However, DEFCon cannot provide its se- 
curity guarantees in the presence of deep copying of data 
while Quire can survive in an environment where deep 
copying is allowed since Quire defines policy based on 
the call chain and ignores the data contained within the 
messages forming the call chain. Asbestos avoids the 
deep copy problems of DEFCon by tagging data at the 
IPC level. While Asbestos and Quire use a similar ap- 
proach to data tagging, the tags are used for very dif- 
ferent purposes. Asbestos aims to prevent data leaks by 
enabling an application to tag its data and disallow a re- 
cipient application from leaking information that it re- 
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ceived over an IPC channel while Quire attempts to pre- 
emptively disallow data from being leaked by protecting 
the resource itself, rather than allowing the resource to 
be accessed then blocking leakage at the taint sink. 


6.3 Operating system security 


Communication in Quire is closely related to the mech- 
anisms used in Taos [38]. Both systems intend to pro- 
vide provenance to down stream callees in a communi- 
cation chain, however Taos uses expensive digital signa- 
tures to secure its communication channels while QuirRE 
uses quoting and inexpensive MACs to accomplish the 
same task. This notion of substituting inexpensive cryp- 
tographic operations for expensive digital signatures was 
also considered as an optimization in practical Byzantine 
fault tolerance (PBFT) [7] for situtations where network 
latency is low and the additional message transmissions 
are outweighed by the cost of expensive RSA signatures. 


6.4 Trusted platform management 


Our use of a central authority for the authentication 
of statements within QuirE shares some similarities 
with projects in the trusted platform management space. 
Terra [16] and vTPM [4] both use virtual machines as 
the mechanism for enabling trusted computing. The ar- 
chitecture of multiple segregated guest operating systems 
running on top of a virtual machine manager is similar to 
the Android design of multiple segregated users running 
on top of acommon OS. However, these approaches both 
focus on establishing the user’s trust in the environment 
rather than trust between applications running within the 
system. 


6.5 Web security 


Many of the problems of provenance and application 
separation addressed in Quire are directly related to 
the challenge of enforcing the same origin policy from 
within the web browser. Google’s Chrome browser [3, 
31] presents one solution where origin content is segre- 
gated into distinct processes. Microsoft’s Gazelle [36] 
project takes this idea a step further and builds up 
hardware-isolated protection domains in order to protect 
principals from one another. MashupOS [19] goes even 
further and builds OS level mechanisms for separating 
principals while still allowing for mashups. 

All of these approaches are more interested in protect- 
ing principals from each other than in building up the 
communication mechanism between principals. QuIRE 
gets application separation for free by virtue of Android’s 
process model, and focuses on the expanding the capa- 
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bilities of the communication mechanism used between 
applications on the phone and the outside world. 


6.6 Remote procedure calls 


For an overview of some of the challenges and threats 
surrounding authenticated RPC, see Weigold et al. [37]. 
There are many other systems which would allow for se- 
cure remote procedure calls from mobile devices. Ker- 
beros [22] is one solution, but it involves placing too 
much trust in the ticket granting server (the phone man- 
ufacturers or network providers, in our case). Another 
potential is OAuth [17], where services delegate rights to 
one another, perhaps even within the phone. This seems 
unlikely to work in practice, although individual Quire 
applications could have OAuth relationships with exter- 
nal services and could provide services internally to other 
applications on the phone. 


7 Future work 


We see Quire as a platform for conducting a variety of 
interesting security research around smartphones. 


Usable and secure UI design. The IPC extensions 
Quire introduces to the Android operating system can 
be used as a building block in the design and imple- 
mentation of a secure user interface. We have already 
demonstrated how the system can efficiently sign every 
Ul event, allowing for these events to be shared and dele- 
gated safely. This existing application could be extended 
to attest to the full state of the screen when a security crit- 
ical action, such as an OAuth accept/deny dialog, occurs 
and prevent UI spoofing attacks. 


Secure login. Any opportunity to eliminate the need 
for username/password dialogs from the experience of a 
smartphone user would appear to be a huge win, particu- 
larly because it’s much harder for phones to display tra- 
ditional trusted path signals, such as modifications to the 
chrome of a web browser. Instead, we can leverage the 
low-level client-authenticated RPC channels to achieve 
high-level single-sign-on goals. Our PayBuddy applica- 
tion demonstrated the possibility of building single-sign- 
on systems within Quire. Extending this to work with 
multiple CAs or to integrate with OpenID / OAuth ser- 
vices would seem to be a fruitful avenue to pursue. 


Web browsers. While Quire is targeted at the needs of 
smartphone applications, there is a clear relationship be- 
tween these and the needs of web applications in modern 
browsers. Extensions to QuirE could have ramifications 
on how code plugins (native code or otherwise) interact 


20th USENIX Security Symposium 


with one another and with the rest of the Web. Exten- 
sions to QuirE could also form a substrate for building 
a new generation of browsers with smaller trusted com- 
puting bases, where the elements that compose a web 
page are separated from one another. This contrasts with 
Chrome [31], where each web page runs as a monolithic 
entity. Our Quire work could lead to infrastructure sim- 
ilar, in some respects, to Gazelle [36], which separates 
the principals running in a given web page, but lacks our 
proposed provenance system or sharing mechanisms. 

An interesting challenge is to harmonize the differ- 
ences between web pages, which increasingly operate as 
applications with long-term state and the need for ad- 
ditional security privileges, and applications (on smart- 
phones or on desktop computers), where the principle 
of least privilege [32] is seemingly violated by running 
every application with the full privileges of the user, 
whether or not this is necessary or desirable. 


8 Conclusion 


In this paper we presented QuirE, a set of extensions to 
the Android operating system that enable applications to 
propagate call chain context to downstream callees and 
to authenticate the origin of data that they receive in- 
directly. These extensions allow applications to defend 
themselves against confused deputy attacks on their pub- 
lic interfaces and enable mutually untrusting apps to ver- 
ify the authenticity of incoming requests with the OS. 
When remote communication is needed, our RPC sub- 
system allows the operating system to embed attestations 
about message origins and the IPC call chain into the re- 
quest. This allows remote servers to make policy deci- 
sions based on these attestation. 

We implemented the Quire design as a backwards- 
compatible extension to the Android operating system 
that allows existing Android applications to co-exist with 
applications that make use of Quire’s services. 

We evaluated our implementation of the Quire design 
by measuring our modifications to Android’s Binder IPC 
system with a series of microbenchmarks. We also im- 
plemented two applications which use these extensions 
to provide click fraud prevention and in-app micropay- 
ments. 

We see Quire as a first step towards enabling more se- 
cure mobile operating systems and applications. With the 
Quire security primitives in place we can begin building 
a more secure UI system and improving login on mobile 
devices. 
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Abstract 


Mobile communication is an essential part of our 
daily lives. Therefore, it needs to be secure and reliable. 
In this paper, we study the security of feature phones, 
the most common type of mobile phone in the world. 
We built a framework to analyze the security of SMS 
clients of feature phones. The framework is based on 
a small GSM base station, which is readily available 
on the market. Through our analysis we discovered 
vulnerabilities in the feature phone platforms of all 
major manufacturers. Using these vulnerabilities we 
designed attacks against end-users as well as mobile 
operators. The threat is serious since the attacks can 
be used to prohibit communication on a large scale 
and can be carried out from anywhere in the world. 
Through further analysis we determined that such 
attacks are amplified by certain configurations of the 
mobile network. We conclude our research by providing 
a set of countermeasures. 


1 Introduction 


In recent years a lot of effort has been put into analyz- 
ing and attacking smartphones [18, 20, 24, 21, 22, 23, 
46, 45], neglecting the so-called feature phones. Feature 
phones, mobile phones that have advanced capabilities 
besides voice calling and text messaging, but are not con- 
sidered smartphones, make up the largest percentage of 
mobile devices currently deployed on mobile networks 
around the world. In comparison, smartphones only ac- 
count for about 16% of all mobile phones [43]. The lack 
of security research into the far more popular feature 
phones is explained by the fact that smartphones share 
much commonality with desktop computers, and, there- 
fore are easier to analyze. Researches are able to use the 
same or similar tools that they are already familiar with 
on desktop computers. Feature phones on the other hand 
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are highly embedded systems that are closed to develop- 
ers. This results in billions (there are about 4.6 billion 
mobile phone subscribers [43, 16]) of potentially vulner- 
able mobile devices out in the field, just waiting to be 
taken advantage of by a knowledgeable attacker. 

In this paper, we investigate the security of feature 
phones and the possibility for large scale attacks based on 
discovered vulnerabilities in these devices. We present a 
novel approach to the vulnerability analysis of feature 
phones, more specifically for their SMS client imple- 
mentations. SMS is interesting because it is the feature 
that exists on every mobile phone. Furthermore, security 
issues related to SMS messaging can be exploited from 
almost anywhere in the world, and, thus present the ideal 
attack vector against such devices. To the best of our 
knowledge, no attempt has been made before to analyze 
or test feature phones for security vulnerabilities. 

Analyzing feature phones is difficult for several rea- 
sons. First of all, feature phones are completely closed 
devices that do not allow for development of native appli- 
cations and do not provide debugging tools. Moreover, 
analyzing the part of the phone that interacts with the 
mobile phone network is hard since the mobile phone 
network between us and the target device is essentially 
a black box. As a consequence, analysis becomes time 
consuming, unreliable, and costly. 

We address these problems by building our own GSM 
network using equipment that can be bought on the mar- 
ket. We use this network not only for sending SMS mes- 
sages to the phones we analyze, but also as an advanced 
monitoring system. The monitoring system replaces our 
need for debuggers and other tools that are normally re- 
quired for thorough vulnerability analysis, but do not ex- 
ist for feature phones. 

Vulnerability analysis was conducted using fuzzing. 
We chose fuzzing as the testing technique because we 
did not have access to source code and reverse engineer- 
ing a large number of devices is not feasible. Addition- 
ally, fuzzing proved to be very efficient since this allowed 
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us to analyze a large amount of mobile handsets with the 
same set of tests. 

So far, we have found numerous vulnerabilities in fea- 
ture phones sold by the six market leading mobile phone 
manufacturers. The vulnerabilities are security critical 
as they can remotely crash and reboot the entire target 
phone. In the process the mobile phone is disconnected 
from the mobile network, interrupting any active calls 
and data connections. Such bugs and attacks have ex- 
isted before on the Internet, known as Ping-of-Death [6]. 
We believe this represents a serious threat to mobile tele- 
phony world wide. 

To complete our research we further analyzed the 
effect of such attacks on the mobile phone core network. 
This resulted in two interesting findings. First, the 
mobile phone network can be abused to amplify our 
Denial-of-Service attacks. Second, by attacking mobile 
phones one can attack the mobile phone network itself. 


The main contributions of this paper are: 


e Vulnerability Analysis Framework for Feature 
Phones: We introduce a novel method to conduct 
vulnerability analysis of feature phones that is based 
on a small GSM base transceiver station. We solve 
the major issue of such analysis: the monitoring 
for crashes and other unexpected behavior. We 
present multiple solutions for monitoring such de- 
vices while analyzing them. Our method further- 
more shows that once a system, such as GSM, be- 
comes partially open, the security of the entire sys- 
tem, including the parts that are still closed, can be 
analyzed and exploited. 


Bugs Present in Most Phones: We show that vul- 
nerabilities exist in most mobile phones that are de- 
ployed on mobile networks around the world today. 
The bugs we discovered can be abused for carrying 
out large scale Denial-of-Service attacks. 


Attack Impact: We show that a small number of 
bugs in the most popular mobile phone brands is 
enough to take down a significant number of mobile 
phones around the world. We further show that bugs 
present in mobile phones can possibly be used to 
attack the mobile phone network infrastructure. 


The rest of this paper is structured in the following 
way. In Section 2 we discuss related work and show how 
our research extends previous work in this area. In Sec- 
tion 3 we explain how we selected our targets for analy- 
sis and resulting attacks. In Section 4 we show in great 
detail how to analyze feature phones for security vulner- 
abilities. In Section 5 we layout methods to use the vul- 
nerabilities discovered for large scale attacks on mobile 
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communication. In Section 6 we present methods for de- 
tecting and preventing the attacks we designed. In Sec- 
tion 7 we briefly conclude. 


2 Related Work 


Related work is separated into four parts. First, smart- 
phone vulnerability analysis. Second, mobile and feature 
phone bugs, which were all found purely by accident. 
Third, studies on attacks against mobile phone networks. 
Fourth, Denial-of-Service (DoS) attacks since we are go- 
ing to present a large scale mobile phone DoS attack in 
this paper. 

The authors of [24] built a framework for security 
analysis of Multimedia Messaging Service (MMS) im- 
plementations on Windows Mobile based smartphones. 
Similar research in [23] conducted vulnerability analy- 
sis of Short Message Service (SMS) implementations of 
smartphones. Both used traditional techniques such as 
debuggers and analysis of crash dumps to catch excep- 
tions generated during fuzzing. 

Our work presented in this paper is different, as we do 
not rely on debugging capabilities provided by the vari- 
ous manufacturers, which mostly do not provide such ca- 
pabilities at all. Instead we use a small GSM base station 
to monitor and catch abnormal behavior of the phones 
by monitoring and analyzing radio link activity. MMS- 
based attacks that lead to battery exhaustion due to in- 
creasing power consumption have been studied in [39]. 
They utilized the fact that MMS messages use more bat- 
tery resources because of GPRS and increased CPU us- 
age. However, we did not conduct this kind of analysis 
since our focus was software bugs in SMS implementa- 
tions. 

Over the last few years a small number of bugs have 
been discovered by individuals. Most of them have been 
found by accident. To our knowledge no systematic test- 
ing has been conducted. Some examples are: the Curse- 
of-Silence [44] named bug for Symbian OS that prevents 
a phone from further receiving any SMS after receiving 
the curse SMS message. The WAP-Push vCard bug on 
Sony Ericsson phones [33] that caused a target phone to 
reboot. Some Nokia phones [34] contained a bug that 
could be abused to remotely crash a phone by sending it 
a specially crafted vCard via SMS. Some mobile phones 
produced by Siemens contained a bug [17] that would 
shutdown the phone when displaying an SMS message 
that contained a special character. Bugs like these fuelled 
our research effort since we believed that most phones 
contain similar bugs. A large number of similar issues 
in an exploit arsenal can likely be used to carry out at- 
tacks against a bigger percentage of mobile phone users 
around the world. 

Enck et al. show in [47] that SMS messages sent over 
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the Internet can be used to carry out a Denial-of-Service 
attack against mobile phone networks. The attack fo- 
cused on blocking the mobile network’s control chan- 
nels, therefore, no more calls could be initiated. Solu- 
tions against this type of resource consumption attack 
are investigated in [37]. However our attacks, described 
in this paper, are not based on attacking the radio link 
(the control channel) in any way. We attack the hand- 
sets directly without targeting the control channel. A 
study on the capabilities of mobile phone botnets [36] 
shows that these could be used to carry out DoS attacks 
against a mobile network. The attack works by over- 
loading the Home Location Register (HLR) by trigger- 
ing large amounts of state changes by zombie phones. 
However, in this paper we show that one can achieve a 
similar kind of DoS attack against an operators network 
by disconnecting large amounts of mobile phones from 
the network. The difference to the botnet approach is that 
we do not need to have control over the zombie phones in 
the first place. We can remotely force them to reboot and 
disconnect and re-authenticate to the network and thus 
cause a higher load on the network core infrastructure. 

Denial-of-Service attacks such as the one presented in 
this work have been studied in a wide area. Attacks rang- 
ing from the Web to DNS [38]. More interesting in our 
context are attacks that disable real-world systems and 
processes such as emergency services [29] (although just 
as a side effect) or even postal services [40]. 

Essentially the work presented in this paper is differ- 
ent in many aspects. We focus on feature phones because 
feature phones are much more popular than smartphones. 
Therefore, attacks against feature phones have a larger 
global impact. In this work we present a security testing 
framework for analyzing SMS implementations of any 
kind of mobile phone. We used this framework to ana- 
lyze feature phones of the most popular manufacturers in 
the world, as shown in Section 3. We also performed this 
type of analysis because it has not been done in the past, 
even though these devices are widely deployed. 


3 Target Selection 


To achieve maximum impact with an attack, it makes 
sense to target the most popular devices. We deter- 
mined that feature phones are the dominant type of mo- 
bile phones. They account for 83% of the U.S. mobile 
market [10], smartphones in comparison just make for 
16% of all mobile phones world wide [43]. We acknowl- 
edge that today smartphone sales are rising very fast, but 
feature phones still dominate when it comes to deployed 
devices in the field. 

Most of the definitions of the term feature phone are 
a bit fuzzy. A loose definition of the term is: every mo- 
bile phone that is neither a dumb phone nor a smartphone 
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is considered a feature phone. Dumb phones are phones 
with minimal functionality, often they only support voice 
calls and sending SMS messages, just basic functional- 
ity. Feature phones have less functionality than smart- 
phones but still more than dumb phones. Feature phones 
have proprietary operating systems (firmware) and have 
additional features (thus the term feature) such as play- 
ing music, surfing the web, and running simple applica- 
tions (mostly J2ME [41]). Despite this lack of function- 
ality (compared to smartphones) they are quite popular 
because they are cheap and offer long battery life. 

Technically interesting is the fact that feature phones 
are based on a single processor that implements the base- 
band, the applications, and user interface. Smartphones 
usually have a dedicated processor for the baseband. The 
consequence of this is that a simple bug on a feature 
phone may bring down the complete system. 

Mobile phones are produced by many different manu- 
facturers that all have their own OS, therefore, targeting 
a single one of them will not result in global effect. Since 
we can not simply target all mobile phone platforms we 
have to select the few ones that have enough market share 
to be of global relevance. 

To determine the major mobile phone manufacturers 
we analyzed various market reports: World wide [42] 
and European [31] market share. Market shares in the 
United States [28] and in Germany [27]. In the Appendix 
of this paper we include a table containing the raw num- 
bers we gathered from the various market reports. 

Through this analysis we got a clear picture about the 
top manufacturers. These are Nokia, Samsung, LG, 
Sony Ericsson, and Motorola. We further chose 
to add Micromax [4] to the list of interesting mobile 
phone manufacturers because we read [9] that they are 
the third most popular brand of mobile phones in India. 


4 Security Analysis of Feature Phones 


Analyzing feature phones for security vulnerabilities is 
hard for several reasons. There is no access to source 
code of the OS and applications. There are no exist- 
ing native-SDKs, therefore, there is no way to run native 
code on the device and further no access to a debugger. 
JTAG-based debugging is also no option since not all de- 
vices have JTAG enabled. Furthermore, deeper know]- 
edge of the hardware and software is required in order to 
use JTAG debugging in a meaningful way. 

Because of these reasons we choose to conduct fuzz- 
based testing. The testing is carried out on our own GSM 
network. In order to monitor for misbehavior, crashes, 
and to find the related bugs, we designed our own mon- 
itoring system. Throughout this section we will first de- 
scribe the setup of our GSM network. Followed by the 
way we send SMS messages in this setup. Then we will 
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describe our novel monitoring setup. The final part of the 
section will discuss test cases and the resulting bugs that 
were discovered throughout this work. 


4.1 Network Setup 


Since we want to send large amounts of SMS messages 
we decided to build our own GSM network rather than 
sending SMS messages over a real network. On the one 
hand this has the advantage of not costing any money 
and on the other hand we do not risk to interfere with the 
telecommunication networks. We want to avoid crash- 
ing the operator’s network equipment by either content 
or quantity of SMS messages. Having our own network 
assures reproducible results because we have control of 
the entire system and are able to quickly find parameters 
that cause unexpected results. Analysis over a real oper- 
ator network would only leave us with the possibility of 
guessing in many cases. In addition, the delivery of SMS 
messages is much faster on our small network compared 
to a production setup of a mobile operator. 

On the hardware side we decided to use an ip.access 
nanoBTS [32], which is a small, fairly cheap (about 3500 
Euro) GSM Base Transceiver Station (BTS) that pro- 
vides an A-bis over IP interface. The A-bis interface 
is used to communicate between the BTS and the Base 
Station Controller (BSC). The BSC part of our setup is 
driven by OpenBSC [30]. OpenBSC is a Free Software 
implementation of the A-bis protocol that implements 
a minimal version of the BSC, Mobile Switching Cen- 
ter (MSC), Home Location Register (HLR), Authenti- 
cation Center (AuC) and Short Message Service Center 
(SMSC) components of a GSM network. Figure | shows 
a picture of our setup. 





Figure 1: Our setup: A laptop that runs OpenBSC and 
the fuzzing tools, the nanoBTS, and some of the phones 
we analyzed. 


As GSM operates on a licensed frequency spectrum 
we had to carry out our experiments in an Faraday cage. 
Utilizing this setup we are able to send SMS mes- 
sages to a mobile phone. OpenBSC allows us to either 
send a text message from its telnet interface to a sub- 
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scriber of our choice or it processes an SMS message 
that it received Over-the-Air in a store and forward fash- 
ion. As we later see the existing interface is not feasible 
for fuzzing since we need the ability to closely control all 
parameters in the encoded SMS format as well as a way 
to inject binary payloads. 

Using a mobile phone to inject SMS messages into the 
network is not an option as this would be very slow as we 
show later. Instead we built a software framework based 
on a modified version of OpenBSC that allows us to: 


e Inject pre-encoded SMS into the phone network 


e Extensive logging of fuzzing related feedback from 
the phone 


Logging of non-feedback events, i.e. a crash result- 
ing in losing connection to the network 


Automatic detection of SMS that caused a certain 
event 


Process malformed SMS with OpenBSC 
e Smart fuzzing of various SMS features 
e Ability to fuzz multiple phones at once 


e Sending SMS at higher rate than on a real network 


The format of an SMS [15] differs depending 
on whether the message is Mobile Originated 
(MO) or Mobile Terminated (MT). This is 
mapped to the two formats SMS_SUBMIT (MO) and 
SMS_DELIVER (MT). Ina typical GSM network, shown 
in Figure 4, an SMS message that is sent from a mo- 
bile device is transferred Over-the-Air to the BTS of an 
operator in SMS_SUBMIT format. Every BTS is han- 
dled by a Base Station Controller (BSC) that is inter- 
acting with a Mobile Switching Center (MSC), which 
acts as the central entity handling traffic within the net- 
work. The MSC relays the SMS message to the respon- 
sible Short Message Service Center (SMSC), which is 
usually a combination of software and hardware that for- 
wards and relays messages to the destination phone or 
other SMSCs (in case of inter-operator messages or an 
operator with multiple SMSCs). In our setup OpenBSC 
acts as BSC, MSC, and SMSC. During the final trans- 
mission to the destination the SMS will get converted 
to SMS_DELIVER, this is taken care of by OpenBSC. 
Both formats are similar and no field that is subject to 
our fuzzing is lost. SMS_SUBMIT only contains the 
destination number and since SMS works in a store-and- 
forward fashion, the destination address is replaced with 
the sender number on the final transmission to the desti- 
nation. SMS_DELIVER does not include the destination 
number but instead relies on an existing channel to the 
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phone (after the phone has been paged). For this rea- 
son we utilize the SMS_SUBMIT format when injecting 
messages. 


4.2 Sending SMS Messages 


OpenBSC itself does not provide an interface to submit 
pre-encoded SMS messages to the network, but only an 
interface to submit text SMS messages that are then con- 
verted into the corresponding encoding. We added a new 
interface to OpenBSC that allows us to submit SMS mes- 
sages directly in SMS_SUBMIT format. These messages 
are inserted into a database that is used by OpenBSC 
as part of the SMSC functionality. In our version not 
only the parsed SMS values are stored, but also the com- 
plete encoded message for easy reproducibility. Modi- 
fying the existing text message interface to be capable 
of handling binary encoded SMS messages proved to be 
infeasible. Messages submitted over this interface are 
instantly transmitted to the subscriber if he is attached to 
the network. This means opening a channel, initiating a 
data connection, sending the message and tearing down 
the connection. This works, but is very slow and takes 
about seven seconds per message. This is also the reason 
why we did not want to use a mobile phone to send our 
fuzz-messages in the first place. Our method of inject- 
ing messages is much faster. Prior to testing we use our 
new interface to inject thousands of messages into the 
SMSC database. Next, we send these messages. Ideally, 
this only opens a channel once and sends all SMS mes- 
sages (pending delivery) to the recipient and then closes 
the connection. This greatly improves the speed at which 
we can fuzz since the actual message transfer only takes 
about one second. 

In essence we removed the sending mobile phone and 
replace it with a direct interface to the network. This way 
it was not necessary to modify the target mobile phone in 
any way. 


4.3 Monitoring for Crashes 


In fuzz-based testing, monitoring is one of the essential 
parts. Without good monitoring one will not catch any 
bugs. 

OpenBSC itself already has an error handler that takes 
care of errors reported from the phone, which we mod- 
ified to fit our fuzzing case. The default error handler 
does not differentiate between errors and is not taking the 
cause of an error into account. It simply stops the SMS 
sending process in case of an error. The only exception 
isaMemory Exceedederror, which causes OpenBSC 
to dispatch a signal handler to wait for an SMMA signal 
(released short message memory) indicating that there is 
enough space again. 
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The mobile phone as well as the MSC are usually di- 
vided into separated layers for transferring and process- 
ing a message. As shown in Figure 2 they consist of 
a Short Message Transport Layer (SM-TL), Short Mes- 
sage Relay Layer (SM-RL) and the Connection Sublayer 
(CM-Sub). The SM-TL [13] receives and relays mes- 
sages that it receives from the application layer in TPDU 
form (Transport Protocol Data Unit). This is the original 
encoding form that we describe later in this paper. The 
message is passed to the SM-RL to transport the TPDU 
to the mobile station. At this point the TPDU is encap- 
sulated as an RPDU. As soon as a connection is estab- 
lished between the mobile station and the network the 
RPDU is transferred Over-the-Air encapsulated in a CP- 
DATA unit that is part of Short Message Control Protocol 
(SM-CP). Both sides communicate via their CM-Subs 
with each other. The CM-Sub on the phone side will 
unpack the CPDU and forward the encapsulated TPDU 
to the Transport Layer using an RP-DATA unit. At this 
point the mobile phone stack has already performed san- 
ity checks on the content of the SMS and parsed it. The 
resulting reply, passed to CM-Sub, will include an ac- 
knowledgement of the SMS message and it will then be 
passed to the higher layers. From there it will end up 
in the user interface or an error message is encapsulated 
and sent back to the network. For our monitoring we 
need to log these replies carefully to observe the status 
of the phone. 


Mobile phone 








MS_DELIVER 











Connection establishment 
SM-RL-DATA-Req 





< RP-DATA 










CP-DATA 


<P DATA 
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Figure 2: Mobile terminated SMS 


From the wide variety of error messages a phone can 
reply to a received SMS message (defined in [14]), we 
observed during our fuzzing experiments that all of the 
tested phones either reply with a Protocol Error 
or Invalid Mandatory Information message 
in the case of a malformed message. These two re- 
sponses besides the memory error have been the only er- 
rors that we observed in practice. We added code to flag 
such an SMS message as invalid in the database and con- 
tinue delivering the next SMS that has not been flagged 
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as invalid. OpenBSC would otherwise continue trying to 
retransmit the malformed SMS message and thus block 
further delivery for the specific recipient. 

SMS messages are usually sent over a SDCCH (Stand- 
alone Dedicated Control Channel) or a SACCH (Slow 
Associated Control Channel). The details of such a chan- 
nel are not important for the scope of this paper. However 
the use of such a logical channel is an important mea- 
surement to detect mobile phone crashes. Such a channel 
will be established between the BTS and the phone on the 
start of an SMS delivery by paging the phone on a broad- 
cast channel. As we explained earlier, we only open the 
channel once and send a batch of messages using this 
one channel. The channel related signaling between the 
BSC and the BTS happens over the A-bis interface over 
highly standardized protocols. We added modifications 
to the A-bis Radio Signaling Link code of OpenBSC that 
allows us to check if a channel tear down happens in a 
usual error condition, log when this happens and which 
phone was previously assigned to this channel. 

So while we lack possibilities to conduct traditional 
debugging methods on the device itself we can use the 
open part - OpenBSC - to do some debugging on the 
other end of the point-to-point connection. 

The difference to traditional debugging techniques is 
that we are mostly limited towards noticing an error con- 
dition and monitoring the impact of such an error. We 
are not able to peek at register values and other soft- 
ware related details of the phone firmware. However, 
it is enough to be able to reliably detect and reproduce 
the error. Using this method it also possible to find code 
execution flaws. However exploiting them and getting to 
know the details about the specific behavior requires the 
effort of reverse engineering the firmware for a specific 
model. We try to avoid such a large scale test of phones 
but these bugs are a good base for further investigations 
such as reverse engineering of firmware. 

In the next step we have written a script that parses the 
log file, evaluates it and takes actions in order to deter- 
mine which SMS message caused a problem. 

When delivering an SMS message to a recipient phone 
under the assumption that it is associated with the cell 
in practice three things can happen. Either the message 
is accepted and acknowledged, it is rejected with a rea- 
son indicating the error, or an unexpected error occurs. 
Such an unexpected error can be that the phone just dis- 
connected because it crashed or due to other reasons the 
received message is never acknowledged. In the latter 
case, OpenBSC stores the SMS message in the database, 
increases a delivery attempt counter and tries to retrans- 
mit the SMS message when the phone associates with 
the cell again. For our fuzzing results this means that 
this method detects bugs in which the SMS message ei- 
ther results in a phone crash after it accepted the message 
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or already during receiving it in which it will never be ac- 
knowledged and OpenBSC continuously tries to deliver 
the SMS message. 

Detecting the SMS message that caused such an error 
condition then is fairly simple. Our script checks the er- 
ror condition and if it occurred because of the loss of a 
channel it first looks up the database to find SMS mes- 
sages that have a delivery count that is bigger or equal 
to one and the message is not marked as sent (meaning 
it was not acknowledged). In this case we can with a 
high probability say that the found SMS message caused 
the problem. If there is no message the script checks 
which messages have been sent in a certain time inter- 
val around the time of the log event. During our testing 
we decided that a one minute time interval works well 
enough to have a fairly small subset of candidate SMS 
messages that could have caused a problem. Figure 3 
shows the logical view of our monitoring setup. 














nanoBTS 
deliver 





SMS inject SMS c : 
Message |—| SMS Database |] Delivery Engine 


Generator [feedback] 





























ty 


= Target Phone 





° 
Fuzzing Framework eS OpenBSC | 
S 











log evaluation " J2ME 
PD) —_ Logging Echo 


Server 


Monitor 



























































Figure 3: Logical view of our setup. 


4.4 Additional Monitoring Techniques 


In addition to the aforementioned OpenBSC setup we 
have developed more methods for monitoring for abnor- 
mal behavior. 

Bluetooth: Bluetooth can be used to check if a de- 
vice crashes or hangs. Our monitor script connects to the 
device using a Bluetooth virtual serial connection (RF- 
COMM) by connecting to the RFCOMM channel for 
the phone’s dial-up service. The script calls recv (2) 
and blocks since the client normally is supposed to send 
data to the phone. When the phone crashes or hangs, the 
physical Bluetooth connection is interrupted and recv(2) 
returns, thus signaling us that something went wrong. 

J2ME: Almost every modern feature phone supports 
J2ME [41] and this is providing us with the only way 
to do measurements on the phone since they do not run 
native applications. Applications running on the mobile 
phone can register a handler in an SMS registry simi- 
lar to binding an application to a TCP/UDP port. SMS 
can make use of a User Data Header [13] (UDH) that 
indicates that a certain SMS message is addressed to a 
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specific SMS-port. When the phone receives a message 
this header field will be parsed and the message is for- 
warded to the application registered for this port. Our 
J2ME application that is installed to the fuzzed phone 
registers to a specific port and receives SMS messages 
on it. For each chunk of fuzzed SMS messages we in- 
ject a valid message that is addressed to this port. The 
application then replies with an SMS message back to 
a special number that is not assigned to a phone. Fig- 
ure 3 shows this as the J2ME echo server. The message 
is just saved to the SMS database. This allows us to eas- 
ily lookup the count of SMS messages for this special 
number in the database and check if it increased or not. 
If not, it is very likely that some odd behavior was trig- 
gered. This kind of monitoring is useful to identify bugs 
that block the phone from processing received messages 
such as those described in [44]. 


4.5 SMS_SUBMIT Encoding 


The SMS_SUBMIT format as defined in [13] consists of 
a number of bit and byte fields, the destination address, 
and the message payload. Below we briefly describe the 
parts the are important for our analysis. We included a 
diagram of the structure of an SMS_SUBMIT message 
in the Appendix. 

TP-Protocol-Identifier (1 octet) describes the type of 
messaging service being used. This references to a 
higher layer protocol or telematic interworking being 
used. While this is included in the specifications, we be- 
lieve that these interworkings are mostly legacy support 
and not in use these days. This makes it an interesting 
target to study unusual behavior. 

TP-Data-Coding-Scheme (1 octet) as described in [12] 
indicates the message class and the alphabet that is used 
to encode the TP-User-Data (the message payload). This 
can be either the default 7 bit, 8 bit or 16 bit alphabet and 
a reserved value. 

The TP-User-Data field together with the TP- 
Protocol-Identifier and the TP-Data-Coding-Scheme are 
the main targets for fuzzing. The receiving phone parses 
and displays the message based on this information. 

However these fields are not enough to cover the com- 
plete range of possible SMS features. If the TP-User- 
Data-Header-Indicator bit (one of the earlier mentioned 
bit fields) is set this indicates that TP-User-Data includes 
a UDH. 

The UDH is used to provide additional control infor- 
mation like headers in IP packets. It can hold multiple so 
called Information Elements [15] (IED, for example el- 
ements for port addressing, message concatenation, text 
formatting and many more. IEIs are represented in a sim- 
ple type-length-value format. We included an example 
UDH with multiple IEIs in the Appendix. 
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4.6 Fuzzing Test-cases 


We have implemented a subset of the SMS specification 
as a Python library to create SMS PDUs (Protocol Data 
Unit) and used this to develop a variety of fuzzers. This 
includes fuzzers for vCard, vCalendar, Extended Mes- 
saging Service, multipart, SIM-Data-Download, WAP 
push service indication, flash SMS, MMS indication, 
UDH, simple text messages and various others fuzzing 
only single fields that are part of a specific SMS feature. 
Some of these features can also be combined. For exam- 
ple most of the features can either consist of single SMS 
message or be part of a multipart sequence by adding the 
corresponding multipart UDH. 

For the scope of this paper we focused on fuzzing mul- 
tipart, MMS indication (WAP push), simple text, flash 
SMS, and simple text messages with protocol ID/data 
coding scheme combinations. These test cases cover a 
wide variety of different SMS features. 

Multipart: SMS originally was designed to send up 
to 140 bytes of user data. Due to 7-bit encoding it is 
possible to send up to 160 bytes. However various SMS 
features rely on the possibility to send more data, e.g. 
binary encoded data. Multipart SMS allow this by split- 
ting payload across a number of SMS messages. This 
is achieved by using a multipart UDH chunk (IEI: 0, 
length: 3). This UDH chunk comprises three one byte 
values. The first byte encodes a reference number that 
should be random and the same in all message parts that 
belong to the same multipart sequence. Based on this 
value the phone is later able to reassemble the message. 
The second byte indicates the number of parts in the se- 
quence and the last byte specifies the current chunk ID. 
By fuzzing these three values we were mainly looking for 
abnormal behavior related to combinations of the current 
chunk ID and the number of chunks in a sequence. For 
example missing chunk and chunk IDs higher than the 
number of total chunks. 

MMS indication: When a subscriber receives 
an MMS (Multimedia Messaging Service) message an 
MMS notification indication message [48] is sent to him. 
This MMS indication is in fact a binary encoded WAP- 
push message sent via SMS. The notification contains 
multiple variable length fields for subject, transaction ID 
and sender name. There are no length fields for these 
values. They are simple zero terminated hex strings. An 
MMS indication message can also consist of multipart 
sequences. Therefore, our fuzzing target were the vari- 
able length field values included in the message seeking 
for classic issues like buffer overflow vulnerabilities. 

Simple text: Implementations of decoders for sim- 
ple 7 bit encoded SMS often work with a GSM alpha- 
bet represented for example with an array. The decoder 
first needs to unpack the 7 bit encoded values and convert 
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them to bytes. After this step it can lookup the charac- 
ter values in the GSM alphabet table. Our fuzzers mixed 
valid 7 bit sequences with invalid encodings that would 
result in no corresponding array index. This could trigger 
all kinds of implementation bugs but most noteworthy 
out of bounds access resulting in null pointer exceptions 
and the like. 

TP-Protocol-Identifier/TP-Data-Coding-Scheme: 
The combination of both of these fields defines how 
the message is displayed and treated on the phone. 
Both of these fields are one byte values and also cover 
several rather unpopular features and reserved values. 
With fuzzing combinations of these values with random 
lengths of user data payload we were aiming for odd 
behavior and bugs in code paths that are seldom used by 
normal SMS traffic. 

Flash SMS: Flash messages are directly displayed on 
the phone without any user interaction and the user can 
optionally save the message to the phone memory. Our 
observations made it clear that often the code that ren- 
ders the flash SMS message on the display is not the 
same as the one that displays a normal message from 
the menu. Therefore, it can be prone to the same imple- 
mentation flaws as simple text messages. Additionally, 
flash SMS can consist of multipart chunks and there are 
several combinations of TP-Protocol-Identifier and TP- 
Data-Coding-Scheme that cause the phone to display the 
SMS as flash message. Our flash SMS fuzzers aim to 
cover a combination of all of the above possible imple- 
mentation weaknesses. 


4.7 Fuzzing Trial 


After each fuzzing-test-run we evaluate the log gener- 
ated by our monitoring script. All of the bugs described 
later in this paper were triggered by one or very few SMS 
messages and reproducing problems from log entries was 
rarely problematic. However, during our fuzzing stud- 
ies we stumbled across various forms of strange behav- 
ior. Problems we faced included non-standard conform- 
ing message replies and various kinds of weird behav- 
ior. Some phones were not properly reporting memory 
exhaustion. Others did not notice free memory until a re- 
boot. Some did not display a received SMS message on 
the user interface which made it hard to tell if the phone 
accepted a message or silently discarded it on the phone. 
Almost every phone we fuzzed needed a hard reset at 
some point because it became simply unusable for un- 
known reason, the mass of messages or a specific SMS 
needed to be deleted from the SIM card using another 
phone. One of the biggest issues we came across was 
that very few manufacturers’ hard reset actually restored 
the phone to an initial factory state. From what we know 
this is done as a feature for customers in order to ensure 
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no personal data is lost. The behavior also differed be- 
tween phones of the same manufacturer. When testing a 
bug on the Samsung B5310 it was always sufficient to re- 
move the offending SMS message from the phone’s SIM 
card while the Samsung $5230 needed an additional hard 
reset. Understanding such issues proved to be extremely 
time-consuming. However, it is worth noting that purg- 
ing a phone of all personal information can prove to be 
nearly impossible for a user. This can become an issue 
whenever a user plans to sell a used handset to a third 


party. 


4.8 Results 


During our fuzz-testing we discovered quite a few bugs 
that lead to security vulnerabilities. The bugs mostly 
lead to phones crashing and rebooting, which discon- 
nected the phones from the mobile network and inter- 
rupted ongoing voice calls and data connections. Our 
testing even resulted in two bricked phones that could 
no longer be reset and brought back into working order. 
We did not investigate the bricking in-depth because this 
would have gotten quite costly. Furthermore, some of 
the phones crash during the process of receiving the SMS 
message, and, therefore, fail to acknowledge the message 
thus causing re-transmission of the SMS message by the 
network. 

Below we present some of the bugs we discovered on 
each platform. In most cases we fuzzed only one phone 
from each platform and later only verified the bugs on 
other phones we had access to. This is expected because 
most manufacturers base their entire product line on a 
single software platform. Only customizing options such 
as the user interface depending on the hardware of a spe- 
cific device. 

We reported all bugs to the manufacturers including 
full PDUs in order to verify and reproduce them. The 
feedback we received indicates that the bugs are present 
in most of their products based on their feature phone 
platforms. So far we have not received any information 
about fixes or updates. 

Nokia S40: On our test devices 6300, 6233, 
6131 NFC, 3110c we found a bug in the flash SMS 
implementation. The phones run different versions of the 
S40 operating system, the oldest of which was over 3 
years older than the newest. The manufacturer confirmed 
that this bug is present in almost all of their S40 phones. 
By sending a certain flash SMS the phone crashes and 
triggers the Nokia white-screen-of-death”. This also re- 
sults in the phone disconnecting and re-connecting to the 
mobile phone network. Most notably, the SMS actually 
never reaches the mobile phone. The phone will crash 
before it can fully process and acknowledge the message. 
On the one hand this has the side effect that the GSM net- 
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work performs a Denial-of-Service attack for free as it 
continuously tries to transmit the message to the phone. 
On the other hand this has a side effect on the phone since 
there seems to be a watchdog in place that is monitoring 
such crashes. This watchdog shuts down the phone af- 
ter 3 to 5 crashes depending on the delay between the 
crashes. 

Sony Ericsson: Our test devices W800i, W810i, 
W890i, Aino running OSE have a problem similar 
to the Nokia phones. When combining certain payload 
lengths together with a specific protocol identifier value 
it is possible to knock the phone off the network. In 
this case there is no watchdog, but one SMS message is 
enough to force a reboot of the phone. As in the case of 
the Nokia bug, this SMS message will never be acknowl- 
edged by the phone. To get an idea on how wide spread 
the problem is, we investigated the age of the devices and 
found that the oldest phone (W800i) is from 2005 while 
the newest phone (Aino) is from late 2009. 

LG: Our LG GM360 seems to do insufficient bounds 
checking when parsing an MMS indication message. 
This allows us to construct an MMS indication SMS 
message containing long strings that span over three or 
more sms. This crashes the phone and thus forces an un- 
expected reboot when receiving the message or as well 
when trying to open the SMS message on the phone. 

Motorola: As aforementioned, SMS supports telem- 
atic interworking with other network types. By send- 
ing one SMS message that specifies an Internet elec- 
tronic mail interworking combined with certain charac- 
ters in the payload it is possible to knock the phone off 
the mobile network. Upon receiving the message the 
phone shows a flashing white screen similar to the one 
shown by the Nokia phones. The phone does not com- 
pletely reboot; instead it simply restarts the user interface 
and reconnects to the network. This process takes a few 
seconds and depending on the payload it is possible to 
achieve this twice in a row with one message. We ver- 
ified this on the Razr, Rokr, and the SVLR L7 — 
older, but extremely popular devices. The devices span 
3+ years, providing us with confidence that the bug is 
present in their entire platform. 

Samsung: Multipart UDH chunks are commonly used 
for payloads that span over multiple SMS messages. The 
header chunk for multipart messages is simple. 

Our Samsung phones $5230 and B5310 do not prop- 
erly validate such multipart sequences. This allows us to 
craft messages that show up as a very large SMS mes- 
sage on the phone. When opening such a message the 
phone tries to reassemble the message and crashes. De- 
pending on the exact model one to four SMS messages 
are needed to trigger the bug. 

Micromax: The Micromax X114 is prone toa sim- 
ilar issue like the Samsung phones but behaves slightly 


USENIX Association 


differently. When sending one SMS that contains a mul- 
tipart UDH with a higher chunk ID than the overall num- 
ber of chunks and a reference ID that has not been used 
yet, the phone receives the SMS message without in- 
stantly crashing. However a few seconds after the re- 
ceipt the display turns black for some seconds before the 
phone disconnects and reconnects to the network. 


4.9 Validation and Extended Testing 


After the initial fuzz-testing we needed to validate our 
results over a real operator network since we tested in a 
closed environment — our own GSM network. We need 
to evaluate if the bugs can be triggered in the real world 
or if operator restrictions prevent this. For the validation 
we put an active SIM card (of the four German operators) 
into our test phones and connected them to a real mobile 
phone network. We sent the SMS PDUs that triggered 
the bugs using the AT command interface of another mo- 
bile phone. These tests validated all the bugs described 
in the previous section. 

During our fuzzing tests we deactivated the security 
PIN on the SIM cards we used in the target phones so that 
we did not have to enter the PIN on every reboot. We also 
tested the phones with an enabled SIM PIN. Our goal was 
to determine if such reboots also reset the baseband and 
the SIM card. If the SIM card is blocked after reboot the 
phone is not reconnected to the GSM network, and, thus, 
the user is cut off permanently. We determined that this 
is true for our LG, Samsung, and Nokia devices. 


4.10 Bug Characterization 


We group the discovered bugs depending on the software 
layer they trigger. 

The first group are bugs that require user interaction 
such as the bug we discovered in the Samsung mobile 
phones. In this case the user has to view the message in 
order to trigger the bug. 

The second group are bugs that crash without user in- 
teraction. These bugs occur as soon as the phone has 
completed receiving the entire message and starts pro- 
cessing it. In this group we put the bugs we found on the 
Motorola, LG, and Micromax devices. 

The third and last group are bugs that trigger at a lower 
layer of the software stack. With lower layer we mean 
during the process of receiving the SMS message from 
the network. A crash during the transfer process means 
that the process is not completed and the network be- 
lieves the message is not successfully delivered to the 
phone. We categorize the bugs discovered in our Nokia 
$40 and the Sony Ericsson devices in this third group. 
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5 Implementing the Attack 


The attacks presented in this work utilize SMS messages 
to trigger software bugs and crash mobile handsets, in- 
terrupting mobile communications. These bugs cover the 
mobile phone platforms of all major handset manufactur- 
ers and a wide variety of different models and firmware 
versions. The resulting bug arsenal can potentially be 
abused to carry out a large scale attack. 


5.1 Building a Hit-List 


To launch an attack phone numbers of mobile phones 
need to be acquired since simply sending SMS messages 
to every possible number is problematic. Furthermore, 
sending SMS messages to a large number of unconnected 
phone numbers dark address space could trigger some 
kind of fraud prevention system, such as observed on 
the Internet to detect worms [7]. In addition, for the de- 
scribed attack only phone numbers that are connected to 
a mobile phone are of interest. Depending on the kind of 
attack, a different set of phone numbers is required. In 
one case an attack might be targeted towards a specific 
mobile operator, therefore, only phone numbers that are 
connected to the specific operator are of interest. 


Regulatory Databases: In many countries around 
the world mobile network operators have their own area 
codes. Some examples are Germany, Italy”, the United 
Kingdom’, and Australia*. Such area codes can be read- 
ily acquired to help building a hit-list. Likewise one can 
use the North American Numbering Plan (NANP) to de- 
termine which area exchange codes are used by mobile 
operators. 


Web Scraping: Web Scraping is a technique to col- 
lect data from the World Wide Web through automated 
querying of search engines using scripted tools. Find- 
ing German mobile phone numbers can be easily done 
through queries like "+49151*" site:.de. More- 
over, online phonebooks [2] also include mobile phone 
numbers. These sites often allow wild card searches, and, 
thus can be abused to harvest mobile phone numbers. 


HLR Queries: Some Bulk SMS operators [5] offer a 
service to query the Home Location Register (HLR) for a 
mobile phone number. These queries are very cheap (we 
found one for only 0.006 Euro) and answers the ques- 
tion if a mobile phone number exists and where it is 
connected. Together with the information from the reg- 
ulatory databases one can easily generate a list of a few 
thousand mobile phone numbers that belong to a specific 
mobile network operator. 
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5.2 Sending SMS Messages 


SMS messages can be sent by a mobile phone that pro- 
vides either an API that allows it to send arbitrary binary 
messages or through its AT command interface. We used 
the AT interface for most of our testing and validation. 
To carry out any kind of large scale attack a way for de- 
livering large quantities of SMS messages for low price 
is needed. Multiple options exist to achieve this: 

Bulk SMS Operators: Bulk SMS operators such 
as [1, 5, 3] offer mass SMS sending over the Internet 
providing various methods ranging from HTTP to FTP 
and the specialized SMPP (Short Messaging Peer Proto- 
col). Bulk SMS operators are so-called External Short 
Message Entity (EMSE) that are often connected via In- 
ternet to the mobile operators but sometimes have their 
own SS7 connection to the Public Switched Telephone 
Network (PSTN). Figure 4 shows the various connec- 
tions of an EMSE. All Bulk SMS operators operate in 
the same way. For a given amount of money they de- 
liver SMS messages to the specified destination(s). No 
questions asked. Most of the APIs support sending a sin- 
gle message to a list of recipients. Prices range from 0.1 
to 0.01 Euro depending on the volume and destination 
of the messages. The APIs among the bulk SMS opera- 
tors differ. Usually they allow to set a number of SMS 
fields from which they assemble the actual payload. Not 
all of them are offering the same predefined fields. For 
example [3] was the only one that allows us to set a TP- 
Protocol-Identifier field. However, we verified that the 
provided APIs are sufficient to carry out the presented 
attacks and to generate attack payloads that are identical 
to those sent from one of our phones. 
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Figure 4: SMS relevant structure of a mobile network 
operator (MNO) network and the links to the PSTN, ES- 
MEs, and other MNOs. 


Mobile Phone Botnets: A botnet consisting of hi- 
jacked mobile or smartphones [35] could also be used 
for such attacks since every mobile phone is capable of 
sending SMS messages. A mobile botnet has the distinct 
advantage of free message delivery and high anonymity 
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for the attacker. using a mobile phone botnet one could 
circumvent restrictions Bulk SMS operator might have 
in different countries. 

SS7 Access: With direct access to the Signaling Sys- 
tem 7 (SS7) of the Public Switched Telephone Network 
(PSTN) an attacker can very easily send SMS messages 
in large quantities, for example to send SMS spam [25]. 
Figure 4 shows the basic network connections of a mo- 
bile network operator. SMS sending via SS7 also has the 
advantage of not being easily traceable, thus an attacker 
can stay hidden for a longer period of time. Addition- 
ally, SMS messages sent via SS7 are not restricted by 
the Bulk SMS Operators (APIs) in terms of content or 
header information that they contain. 


5.3. Reducing the Number of Messages 


There is one issue left with our attack. That is how can 
one determine the type of mobile phone that is connected 
to a specific phone number. If money does not play a 
role in carrying out the attack this issue is easily resolved. 
The attacker just sends multiple SMS messages, each one 
containing the payload for a specific type of phone, to 
each phone number. One of the messages will trigger the 
bug if the phone is vulnerable at all. This works well 
but is not optimal. To reduce the number of messages 
an attacker has to send we developed a technique that 
allows the attacker to determine what kind of phone is 
connected to a specific phone number. Actually we can 
only determine if a specific malicious message has an 
effect on the phone that is connected to a specific number. 

Our method abuses a specific feature present in the 
SMS standard. This feature is called recipient noti- 
fication, it is indicated through the TP-Status-Report- 
Request flag in an SMS message. If the flag is set the 
SMSC notifies the sender of the message when the re- 
cipient has received the message. Most Bulk SMS oper- 
ators support this feature through their APIs. Our method 
works by measuring the delay between sending the mes- 
sage and receiving the reception notification. 

The technique works as follows: First, we send the 
message containing the payload for crash(1). Second, 
when we receive the receipt for that message we send the 
payload for crash(2). Third, we measure the time differ- 
ence between the two notifications. If the difference is 
equal we continue with the next payload. If the differ- 
ence between both notifications is significant we deter- 
mine that the first message crashed the phone. The phone 
needed to reboot and register on the network before be- 
ing able to accept the next message. If there is no noti- 
fication we determine that the phone did not receive the 
message because it crashed before completely accepting 
the message. Fourth, we continue until all crash pay- 
loads are sent. If none of them trigger, the phone number 
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is removed from the hit-list. The method can be opti- 
mized through ordering the crash payloads according to 
the popularity of mobile phones in the targeted country. 

With this method an attacker can optimize a hit-list 
during an ongoing attack by matching bug-to-phone- 
number. This optimized hit-list could as well be used for 
highly targeted attacks. For example against the network 
operator as described in Section 5.5, which explains our 
attack scenarios. 


5.4 Network Assisted Attack Amplification 


Some of the bugs we discovered prevent the phone from 
acknowledging the SMS message to the network. Fig- 
ure 2 shows the states that happen during a message 
transfer from the network to the phone. In the case of 
some of our bugs (Nokia S40 and Sony Ericsson; Bug 
Characterization Section 4.10) the message RP-ACK is 
not sent by the phone. This leads the network to believe 
that the message was not received, therefore, the SMSC 
will try to resend the SMS message to the phone. This re- 
delivery attempt is a perfect attack amplifier somewhat 
similar to smurf attacks [26] on IP networks. 
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Figure 5: Timing of SMS message delivery attempts. 


In our tests, sending malicious SMS messages over 
real operator networks, we discovered that operators 
have different re-transmit timings, shown in Figure 5. 
Furthermore, they also seem to have different transmit 
queues. We measured the delivery timings of some Ger- 
man mobile network operators in order to determine how 
one could abuse the delivery attempts for improving our 
Denial-of-Service attacks. We conducted the test by at- 
tacking one of our Sony Ericsson devices and monitoring 
the phone using the Bluetooth method described in Sec- 
tion 4.4. 

The tests were carried out on the networks of Voda- 
fone, T-Mobile, O2 (Telefonica), and E-Plus. The initial 
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delivery attempt is at minute 0. It shows that all opera- 
tors do a first re-transmit after 1 minute, and a few more 
re-transmits every 5 minutes. In addition to what Fig- 
ure 5 shows, Vodafone does an additional re-delivery 24 
hours after the last delivery shown in the graph. O2 also 
attempts an additional re-delivery 20 hours after the last 
delivery shown in the graph. 

Through the same test we determined that SMS mes- 
sages are not queued, but have an individual re-transmit 
timer. That means an attacker can send multiple mali- 
cious SMS messages to a victim’s phone with a short 
delay between each message and thus can increase the 
effect of the network assisted attack by sending multiple 
messages. 


5.5 Attack Scenarios and Impact 


There are multiple possible attack scenarios such as or- 
ganized crime going after the end-user, the mobile op- 
erator, and the manufacturer to demand money. Attacks 
could also be carried out for fun by script kiddies and the 
like. Below we discuss some possible scenarios. We ac- 
knowledge that some scenarios such as the attack against 
individuals are more likely then an attack against a man- 
ufacturer. 

Individuals: Individuals could be pressured to pay a 
few Euros in order to keep their phone operational. This 
has happened with the Ikke.A [35] worm that requested 
the user to pay 5 Euros in order to get back the control 
over their iPhone. In our case the victim could be forced 
to send a text message to a premium rate number in order 
to be taken off the hit-list. 

Another attack against an individual or a group could 
aim to prevent them from communicating. This can be 
efficiently carried out if the target uses a SIM card with 
security PIN enabled, as we describe in Section 4.9. 

Operators: Operators could be threatened to have all 
their customers attacked. Such an attack would mainly 
kill the operator’s reputation as being reliable. The op- 
erator might also lose money due to people being un- 
able to call and send text messages. In order to have a 
global impact such an attack has to be carried out on a 
very large scale for a longer time. As a result, customers 
could possibly terminate their contract with the operator. 
Such extortion scams were and still are popular on the 
Internet [8]. 

Furthermore, the operator’s mobile network can be 
attacked directly or as a side effect of an large attack 
against its users. This could work when thousands of at- 
tacked phones drop off the network and try to re-connect 
at the same time. This can cause an overload of the back- 
end infrastructure such as the HLR. This kind of attack 
seems likely since mobile networks are not optimized for 
these specific kinds of requests. A similar attack based 
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on unusual requests was shown in [36]. It is not nor- 
mal that thousands of phones try to connect and authen- 
ticate at the same time over and over again. To optimize 
this DoS attack, the attacker needs to make sure to tar- 
get phones connected to different BTSs and MSCs (Fig- 
ure 4) of the targeted operator in order to circumvent bot- 
tlenecks such as the air interface at the BTS. A clogged 
air interface would throttle the attack. 

Manufacturers: Likewise manufacturers could be 
threatened to have their brand name destroyed or weak- 
ened by attacking random people owning their specific 
brand of mobile phones. The attack could cost them 
twice. Once for the bad reputation and second for re- 
placement devices. Even if the phones are not broken 
victims of such an attack will still try to claim their de- 
vice broken to get a replacement. 

Public Distress: A carefully placed attack during a 
time of public distress could lead to large scale prob- 
lems and possibly a panic. One example occurred in 
Estonia [19] in 2007 when a group of people carried 
out a Denial-of-Service attack against the countries Inter- 
net infrastructure. Additionally, cutting off certain user 
groups such as fireman or police officers during an emer- 
gency situation would have a critical impact. Not ev- 
ery country has special infrastructure for emergency per- 
sonal, and, therefore, rely on mobile phones to communi- 
cate. This is even true in countries like Germany where 
every police officer carries a mobile phone since their 
two-way-radios are often not usable. 


6 Countermeasures 


In this section we present countermeasures to detect and 
prevent the kind of attacks we developed. First, we 
present a mechanism to detect our and similar attacks 
through monitoring for a specific misbehavior. Second, 
we discuss filtering of SMS messages. Filtering can be 
done on either the phones themselves or on the network. 
We discuss the advantages and disadvantages of each of 
them. Third, we briefly discuss amplification attacks. 


6.1 Detection 


To prevent our attacks, operators first need to be able 
to detect them. Detection is not very easy since the 
operator does not get to look inside the phone during 
runtime. Therefore, the only possible way to monitor the 
phone is through the network. We propose the following: 


Monitor Phone Connectivity Status: Monitor if a 
phone disconnects from the network right after receiving 
an SMS message. 

Log last N SMS Messages: Log the last NV SMS mes- 
sages sent to a particular phone in order to analyze pos- 
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sible malicious messages after a crash was detected. Use 
the message as input for SMS filters/firewall. 

Use IMEI to Detect Phone Type: The brand and 
type of a mobile phone can be derived from the IMEI 
(International Manufacturer Equipment Identity). This 
is useful to correlated malicious SMS messages to a 
specific brand and type of phone. 


Using this technique it is possible to catch malicious 
SMS messages that cause phones to reboot and lose net- 
work connectivity. This should especially help to catch 
unknown payloads that cause crashes. Such a monitor is 
also capable of detecting if a large attack is in progress by 
correlating multiple SMS-receive-disconnect events in a 
certain time-frame. 


6.2 SMS Filtering 


SMS filtering can be implemented either directly on the 
phone or within the operator’s network. Both possibil- 
ities have inherent benefits and drawbacks that are pre- 
sented in this section. 

It is important to reconsider the process of SMS 
delivery. First, an SMS message is sent from the 
sender phone to the senders SMSC. Next, the senders 
SMSC queries for the SMSC of the recipient and 
delivers the message to the responsible SMSC. Fi- 
nally, the relevant SMSC locates the recipient’s phone 
and delivers the SMS message via the BTS Over-the-Air. 


Client-side SMS Filtering would need to be done 
right after the modem of the phone received and demod- 
ulated all the frames carrying the SMS message and be- 
fore pushing it up the application stack. The filter would 
need to parse the SMS message and check for known bad 
messages similar to signature-based antivirus software or 
a packet filter firewalls. The problem with this solution 
is the update of the signatures. Of course, the parser 
in the SMS filter must be bug free otherwise the attack 
just moves from the phone software to the filter software. 
Also, devices that are already in the field would not profit 
from such a filter since only new phones will have this. 
Also, newer phones will likely not contain bugs that are 
known at the time they are manufactured. Therefore, we 
believe network-side filters make more sense. 

Network-side SMS Filtering takes place on the 
SMSC of the mobile network operator. Therefore, it can 
inspect all incoming and outgoing SMS messages. There 
are multiple advantages of network-side filtering. First, 
the filter software runs on the network, therefore, it cov- 
ers all mobile phones connected to that network. Second, 
changing the filter rules can be done in one central place. 
Third, malicious SMS messages are not sent out to the 
destination mobile phones, therefore, reducing network 
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load during an attack. 

Network-side filters also have drawbacks. First, if a 
phone is roaming within another operator’s network, the 
SMS message does not travel through the network of the 
home operator. Thus the filters are not touched. This is 
the only advantage of phone-side SMS filtering. In this 
case the user becomes attackable as soon as he leaves 
his home network. For traveling business people in Eu- 
rope, this is quite normal. The GSMA already has a solu- 
tion for this issue called SMS homerouting. SMS Home- 
routing as specified in [11] defines that SMS messages 
are always routed through the receiver’s home-network. 
Meaning that all SMS messages travel through SMSCs 
of his service provider at home. SMS messages, there- 
fore, can be filtered by the receiver’s service provider. 
The second issue with network-side filtering is privacy. 
In order to do SMS filtering the operator must be allowed 
to inspect SMS messages. This could be an issue in some 
countries where mobile telephony falls under special reg- 
ulations. 


6.3 Preventing Network Amplification 


Attack amplification through re-transmissions of SMS 
messages should be avoided since this greatly helps an 
attacker. We suggest that operators limit the number of 
re-transmissions. Some operators re-send the messages 
10 times, this seems unnecessary. 


7 Conclusions 


In this paper we have shown how to conduct vulnerabil- 
ity analysis of feature phones. Feature phones are not 
open in any way, the hardware and software are both 
closed and thus do not support any classical debugging 
methods. Throughout our work we have created analy- 
sis tools based on a small GSM base station. We use the 
base station to send SMS payloads to our test phones and 
to monitor their behavior. Through this testing we were 
able to identify vulnerabilities in mobile phones built by 
six major manufacturers. The discovered vulnerabilities 
can be abused for Denial-of-Service attacks. Our attacks 
are significant because of the popularity of the affected 
models — an attacker could potentially interrupt mobile 
communication on a large scale. Our further analysis 
of the mobile phone network infrastructure revealed that 
networks configured in a certain way can be used to am- 
plify our attack. In addition, our attack can be used to not 
only attack the mobile handsets, but through their misbe- 
havior can be used to carry out an attack against the core 
of the mobile phone network. 

To detect and prevent these kind of attacks we suggest 
a set of countermeasures. We conceived a method to de- 
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tect our and similar attacks by monitoring for a specific 
behavior. 
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'http://en.wikipedia.org/wiki/Telephone_numbers_in-Germany 


*http://en.wikipedia.org/wiki/Telephone_numbers_in_Italy 


3http://en.wikipedia.org/wiki/Telephone_numbers_in_the_United_Kingdom 


4http://en. wikipedia.org/wiki/Telephone_numbers_in_Australia 


APPENDIX 


Figure 6 shows the layout of an SMS message in the 
SMS-SUBMIT format. Figure 7 shows the generic lay- 
out of a User Data Header (UDH) with a number of In- 


formation Elements. 


Field 


TP-Message-Type-Indicator 
TP-Reject-Duplicates 
TP-Validity-Period-Format 
TP-Status-Report-Request 
TP-User-Data-Header-Indicator 
TP-Reply-Path 
TP-Message-Reference 
TP-Destination-Address 
TP-Protocol-Identifier 
TP-Data-Coding-Scheme 
TP-Validity-Period 
TP-User-Data-Length 
TP-User-Data 





Figure 6: Format of the SMS_SUBMIT PDU. 





Size 

2 bit 

1 bit 

2 bit 

1 bit 

1 bit 

1 bit 
integer 
2-12 byte 
1 byte 

1 byte 

1 byte/7 byte 
integer 


depends on DCS/UDL 


age I 





Figure 7: The User Data Header 


20th USENIX Security Symposium 


Table 8 shows an overview of the popularity of mobile 
phone manufacturers in Germany, the United States, in 


Europe, and around the world. 


Market Share 
a 





Figure 8: 


72.0% 
13.0% 


(a) Germany, November 2009 


Market Share 








(b) U.S.A., May 2010 





Manufacturer | Market Share 
Nokia 32.8% 
Samsung 12.5% 
LG 4.1% 
Sony Ericsson 3.7% 
Apple 3.0% 
RIM 2.4% 
Others 3.0% 
(c) Europe, June 2010 
Manufacturer | Market Share 
Nokia 38.0% 
Samsung 20.0% 
LG 10.0% 
Sony Ericsson 5.0% 
Motorola 5.0% 
ZTE 4.5% 
Kyocera 4.0% 
RIM 3.5% 
Sharp 2.6% 
Apple 2.2% 
Others 5.0% 








(d) World, for the year 2009 


Mobile phone Manufacturer Market share 
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Abstract 


Prior work has shown that return oriented programming 
(ROP) can be used to bypass WX, a software defense 
that stops shellcode, by reusing instructions from large 
libraries such as libc. Modern operating systems have 
since enabled address randomization (ASLR), which ran- 
domizes the location of libc, making these techniques 
unusable in practice. However, modern ASLR implemen- 
tations leave smaller amounts of executable code unran- 
domized and it has been unclear whether an attacker can 
use these small code fragments to construct payloads in 
the general case. 

In this paper, we show defenses as currently deployed 
can be bypassed with new techniques for automatically 
creating ROP payloads from small amounts of unran- 
domized code. We propose using semantic program ver- 
ification techniques for identifying the functionality of 
gadgets, and design a ROP compiler that is resistant to 
missing gadget types. To demonstrate our techniques, we 
build Q, an end-to-end system that automatically gener- 
ates ROP payloads for a given binary. Q can produce 
payloads for 80% of Linux /usr/bin programs larger 
than 20KB. We also show that Q can automatically per- 
form exploit hardening: given an exploit that crashes 
with defenses on, Q outputs an exploit that bypasses both 
WeX and ASLR. We show that Q can harden nine real- 
world Linux and Windows exploits, enabling an attacker 
to automatically bypass defenses as deployed by industry 
for those programs. 


1 Introduction 


Control flow hijack vulnerabilities are extremely danger- 
ous. In essence, they allow the attacker to hijack the 
intended control flow of a program and instead execute 
whatever actions the attacker chooses. These actions 
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could be to spawn a remote shell to control the program, 
to install malware, or to exfiltrate sensitive information 
stored by the program. 


Luckily, modern OSes now employ W@X and ASLR 
together — two defenses intended to thwart control flow 
hijacks. Write xor eXecute (W@X, also known as DEP) 
prevents an attacker’s payload itself from being directly 
executed. Address space layout randomization (ASLR) 
prevents an attacker from utilizing structures within the 
application itself as a payload by randomizing the ad- 
dresses of program segments. These two defenses, when 
used together, make control flow hijack vulnerabilities 
difficult to exploit. 


However, ASLR and WX are not enforced com- 
pletely on modern OSes such as OS X, Linux, and Win- 
dows. By completely, we mean enforced such that no 
portion of code is unrandomized for ASLR, and that in- 
jected code can never be executed by W@X. For example, 
Linux does not randomize the program image, OS X does 
not randomize the stack or heap, and Windows requires 
third party applications to explicitly opt-in to ASLR and 
WexX. Enforcing ASLR and W@X completely does not 
come without cost; it may break some applications, and 
introduce a performance penalty. 





Previous work [41] has shown that systems that do 
not randomize large libraries like libc are vulnerable to 
return oriented programming (ROP) attacks. At a high 
level, ROP reuses instruction sequences already present 
in memory that end with ret instructions, called gad- 
gets. Shacham showed that it was possible to build a 
Turing-complete set of gadgets using the program code 
of libc. Finding ROP gadgets has since been, to a large 
extent, automated when large amounts of code are left un- 
randomized [16, 21, 38]. However, it has been left as an 
open question whether current defenses, which randomize 
large libraries like libc but leave small amounts of code 
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unrandomized, are sufficient for all practical purposes, or 
permit such attacks. 

In this paper, we show that current implementations are 
vulnerable by developing automated ROP techniques that 
bypass current defenses and work even when there is only 
a small amount of unrandomized code. While it has long 
been known that ASLR and WX offer important protec- 
tion in theory, our main message is that current practical 
implementations make compatibility and performance 
tradeoffs, and as a result it is possible to automatically 
harden existing exploits to bypass these defenses. 

Bypassing defenses on modern operating systems re- 
quires ROP techniques that work with whatever unran- 
domized code is available, and not just pre-determined 
code or large libraries. To this end, we introduce several 
new ideas to scale ROP to small code bases. 

One key idea is to use semantic definitions to deter- 
mine the function, if any, of an instruction sequence. For 
instance, rather than defining movl «, *; retasa 
move gadget [21, 38], we use the semantic definition 
OutReg ¢ InReg. This allows us to find unexpected 
gadgets such as realizing imul $1, %eax, %ebx; 
ret! is actually a move gadget. 

Another key point is that our system needs to grace- 
fully handle missing gadget types. This is comparable 
to writing a compiler for an instruction set architecture, 
except with some key instructions removed; the com- 
piler must still be able to add two numbers even when 
the add instruction is missing. We use an algorithm 
that searches over many combinations of gadget types in 
such a way that will synthesize a working payload even 
when the most natural gadget type is unavailable. Prior 
work [16, 21, 38] focuses on finding gadgets for all gad- 
get types, such that a compiler can then create a program 
using these gadget types. This direct approach will not 
work without additional logic if some gadget types are 
missing. However, we are not aware of prior work that 
considers this. This is essential in our application domain, 
since most programs will be missing some gadget types. 

Our results build on existing ROP research. Previous 
ROP research was either performed by hand [6, 9, 41], or 
focused on large code bases such as libe [38] (1,300KB), 
a kernel [21] (5,910KB) or mobile libraries [16, 24] (size 
varies; on order of 1,000KB). In contrast, our techniques 
work on small amounts of code (20KB). In our evaluation 
(Section 7), we show that Q can build ROP payloads for 
80% of Linux programs larger than 20KB. Q can also 
transplant the ROP payloads into an existing exploit that 
does not bypass defenses, effectively hardening the origi- 


!We use AT&T assembly syntax in this paper, i.e., the source operand 
comes first. 
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nal exploit to bypass W@X and ASLR. Recent work in 
automatic exploit generation [2, 5] can be used to gen- 
erate such exploits. We show that Q can automatically 
harden nine exploits for real binary programs on Linux 
and Windows to bypass implemented defenses. Since 
these defenses can automatically be bypassed, we con- 
clude that they provide insufficient security. 


Contributions. Our main contribution is demonstrating 
that existing ASLR and W@X implementations do not 
provide adequate protection by developing automated 
techniques to bypass them. First, we perform a survey 
of modern implementations and show that they often do 
not protect all code even when they are “turned on’. This 
motivates our problem setting. Second, we develop ROP 
techniques for small, unrandomized code bases as found 
in most practical exploit settings. Our ROP techniques 
can automatically compile programs written in a high- 
level language down to ROP payloads. Third, we evaluate 
our techniques in an end-to-end system, and show that 
we can automatically bypass existing defenses for nine 
real-life vulnerabilities on both Windows and Linux. 


2 Background and Defense Survey 


There is a notion that code reuse attacks like return ori- 
ented programming are not possible when ASLR is en- 
abled at the system level. This is only half true. If ASLR 
is applied to all program segments, then code reuse is in- 
tuitively difficult, since the attacker does not know where 
any particular instruction sequence will be in memory. 
However, ASLR is not currently applied to all program 
segments, and we will show that attackers can use this 
to their advantage. In this section, we explain the W@X 
and ASLR defenses in more detail, focusing on when a 
program segment may be left unprotected. 

Table | summarizes some of these limitations. The key 
insight that we make use of in this paper is that program 
images are always unrandomized unless the program ex- 
plicitly opts in to randomization. On Linux, for instance, 
this mean that developers must set non-default compiler 
flags to enable randomization. Another surprise is that 
W®xX is often disabled when older hardware is used; 
some virtualization platforms by default will omit the 
virtual hardware needed to enable W@X. 


2.1 Wox 


WX prevents attackers from injecting their own payload 
and executing it by ensuring that protected program seg- 
ments are not writable and executable at the same time 
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ASLR 
Operating System | WX stack, sa ote program 
hi libraries | « 

eap image 
Ubuntu 10.04 Yes Yes Yes Opt-In 
Debian Sarge HW Yes Yes Opt-In 
Windows Vista, 7 HW Yes Opt-In Opt-In 
Mac OS X 10.6 HW No Yes No 





Table 1: Comparison of defenses on modern operating 
systems for the x86 architecture with default settings. Opt- 
In means that programs and libraries must be explicitly 
marked by the developer at compile time for the protection 
to be enabled, and that some compilers do not enable 
the marking by default. HW denotes that the level of 
protection depends on hardware. 


(Writable © eXecutable”). Attackers have traditionally 
included shellcode (executable machine code) in their 
exploits as payloads. Since shellcode must be written to 
memory at runtime, it cannot be executed because of the 
WX property. 


WeX Implementation Wé€xX is implemented [29, 30, 
35] using a NX (no execute) bit that the hardware platform 
enforces: if execution moves to a page with the NX bit 
enabled, the hardware raises a fault. On x86, this bit can 
be set using the PAE addressing mode [22]. 

PAE support is disabled by default in Ubuntu Linux, 
since some older hardware does not support it. The Ex- 
ecShield [31] patch, which is included in Ubuntu, can 
emulate W@X by using x86 segments, even when hard- 
ware NX support is not available. Other distributions (such 
as Debian) do not include the ExecShield patch, and do 
not provide any WX protection in default kernels. 

Windows 7 enables W@X? by default for processors 
supporting the NX bit. However, it only enforces W@X 
for binaries and libraries marked as W@X compatible. 
Many notable third-party software programs such as Ora- 
cle’s Java JRE, Apple Quicktime, VLC Media Player and 
others do not opt-in to WX [36]. 


Limitations The main limitation of WX is that it only 
prevents an attacker from utilizing new payload code. The 
attacker can still reuse existing code in memory. For 
instance, an attacker can call system by launching a 


WX is actually a misnomer, because memory is allowed to be 
unwritable and non-executable, but 0 6 0 = 0. 

3W®xX is called DEP by the Windows community. Windows also 
contains software DEP, but this is unrelated to W@X [30]. 
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return-to-libc attack, in which the attacker creates an ex- 
ploit that will call a function in libe without injecting any 
shellcode. W@X does not prevent return-to-libc attacks 
because the executed code is in libc and is intended to 
be executable at compile time. Return Oriented Program- 
ming is another, more advanced attack on W@X, which 
we discuss in Section 2.3. 


2.2 ASLR 


ASLR prevents an attacker from directly referring to ob- 
jects in memory by randomizing their locations. This 
stops an attacker from being able to transfer control to his 
shellcode by hardcoding its address in his exploit. Like- 
wise, it makes return-to-libc and ROP using libc difficult, 
because the attacker will not know where libc is located 
in memory. 


Implementation ASLR implementations randomize 
some subset of the stack, heap, shared libraries (e.g., libc), 
and program image (e.g., the . text section). 

Linux [31, 34] randomizes the stack, heap, and shared 
libraries, but not the program image. Programs can be 
manually compiled into position independent executables 
(PIEs) which can then be loaded to multiple positions 
in memory. Modern distributions [14, 44] only compile 
a select group of programs as PIEs, because doing so 
introduces a performance overhead at runtime. 

Windows Vista and 7 [29, 43] can randomize the loca- 
tions of the program image, stack, heap, and libraries, but 
only when the program and all of its libraries opt-in to 
ASLR. If they do not, some code is left unrandomized. 
Many third-party applications including Oracle’s Java 
JRE, Adobe Reader, Mozilla Firefox, and Apple Quick- 
time (or one of their libraries) are not marked as ASLR 
compatible [36]. Ultimately, this means most Windows 
binaries have unrandomized code. 


Limitations Some attacks on ASLR implementations 
take advantage of the low entropy available for random- 
ization. For instance, Shacham, et al. [42] show that 
brute forcing ASLR on a 32-bit platform takes about 200 
seconds on average. (We do not consider attacks that 
take more than one attempt in this paper; we create ex- 
ploits that succeed on the first try.) Other attacks, such as 
ret2reg attacks, allow the attacker to transfer control 
to their payload by utilizing pointers leaked in registers 
or memory [32]. For instance, the st rcpy function re- 
turns such a pointer to the destination string in the %eax 
register. The applicability of these attacks are heavily 
dependent on the vulnerable program. 
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ret nextAddr > 

ret addr3 g 
pop %ebp memAddr S 

ret addr2 3 
pop seax memValue = 
Consumed By ager! 
Instruction 32-bits 


Figure 1: Example payload for storing memValue to 
memAddr for the scenario described in the text. This 
payload will transfer control to address next Addr after 
writing to memory. 


2.3. Return Oriented Programming 


Return Oriented Programming is a generalization of the 
return-to-libe attack. In a return-to-libe attack the attacker 
reuses entire functions from libc. With ROP, the attacker 
uses instruction sequences found in memory, called gad- 
gets, and chains them together. ROP attacks are desir- 
able because they allow the attacker to perform compu- 
tations beyond the functions of libc (or whatever code 
is unrandomized). This is especially important in the 
context of modern systems, because the unrandomized 
code may not contain useful funcions for the attacker. 
Researchers [16, 21, 41] have shown that it is possible to 
find gadgets for performing Turing-complete operations 
in libc, the windows kernel, and mobile phone libraries. 


Example 2.1 (Return Oriented Programming). Assume 
that the following instruction sequences are in memory 
ataddrl: pop %eax; ret;ataddr2: pop Sebp; 
ret; and at addr3: movl %eax, (%ebp); ret. 
The first two sequences pop a 32-bit value from the stack, 
store it into a register, and then jump to the address stored 
on the stack. If the attacker controls the stack and can 
cause one of these instruction sequences to execute, then 
the attacker can put values in %eax and %ebp and transfer 
control to another address. By chaining together all three 
instruction sequences, the attacker can write to memory 
(and still transfer control to the next gadget). The at- 
tacker’s payload for writing memValue to memAddr is 
shown in Figure 1. It is possible to execute arbitrary 
programs by stringing together gadgets of different types. 
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3 System Overview 


In the next two sections, we describe Q*, our system for 
automatic exploit hardening. Figure 2 shows the end-to- 
end workflow of Q, which is divided into two phases. The 
first phase automatically generates ROP payloads (Sec- 
tion 4). The second phase is exploit hardening (Section 5). 
In exploit hardening, Q takes the ROP payloads gener- 
ated in the first stage and transplants them into existing 
exploits which do not bypass defenses. The resulting 
exploit can then bypass W@X and ASLR. 


4 Automatically Generating  Return- 


Oriented Payloads 


Q’s end-to-end return oriented programming system con- 
sists of a number of different stages. Previous research 
on automated ROP has typically focused on one specific 
stage; for instance, gadget discovery [16, 24, 38] or com- 
pilation [6]. Since Q is an end-to-end ROP system, it has 
multiple stages. We describe each stage in the context of 
a user’s potential interaction with the system below. 


4.1 Example Usage Scenario 


Assume that Alice wants to create a ROP payload that 
calls system (her target program) using instructions 
from rsync’s unrandomized code (her source program). 
Here, source program means the program from which Q 
takes instruction sequences to construct gadgets (e.g., the 
program with a vulnerability), and target program means 
the program Alice wants to run (using ROP). Alice would 
use the following stages of Q, which are depicted in the 
top half of Figure 2: 


Gadget Discovery The first stage of Q is to find gad- 
gets in the source program that Alice provides — in this 
case, rsync. The gadgets will be the building blocks for 
the ROP payloads that are ultimately created, and thus it is 
important to find as many as possible. Q finds gadgets of 
various types (specified in Table 2) by using semantic pro- 
gram verification techniques on the instruction sequences 
found in rsync. 

Q’s semantic engine allows it to find gadgets that hu- 
mans might miss. For instance, Q can automatically 
determine that lea (%ebx, %ecx,1),%eax; ret 
adds %ebx with %ecx and stores the result in %eax. 
Likewise it discovers that sbb %eax, %eax; neg 
%$eax; ret moves the carry flag (CF) to %eax. 


4We name our system after Q from the James Bond movies, who cre- 
ates, modifies, and combines gadgets to help Bond meet his objectives. 
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Figure 2: An overview of Q’s design. 


Input Alice writes the target program that she wants to 
execute in Q’s high level language, QooL (shown in Table 
3). The target program calls system with the desired 
arguments (e.g., /bin/sh). 


Gadget Arrangement Q builds a list of gadget ar- 
rangements. Each gadget arrangement is a way of im- 
plementing the target program using different types of 
gadgets. For example, we show a gadget arrangement 
for writing to memory in Figure 3; this arrangement is 
the most natural way of storing to memory, but will not 
work if Q can not find a STOREMEMG gadget. Gadget 
arrangement is somewhat analogous to instruction selec- 
tion in a compiler. A major difference is that a regular 
compiler can use whichever instructions it chooses, but 
Q is limited to the gadget types that were found during 
gadget discovery. 

Gadget arrangement allows Q to cope with missing 
gadgets. If the most natural choice of gadget is not avail- 
able, Q effectively tries to synthesize a combination of 
other gadgets that will have the same semantics. We are 
not aware of other ROP compilers that consider this. 


Gadget Assignment Gadget assignment takes gadgets 
found during discovery, and assigns them in the arrange- 
ments that Q generated. The difficulty is that assignments 
must be compatible. This means that the output register of 
one gadget must match the input register on the receiving 
gadget. Likewise, gadgets cannot clobber a register if that 
value is waiting to be used by a future gadget. This phase 
is roughly analogous to register allocation in a traditional 
compiler. Unlike a traditional compiler, Q cannot spill 
registers to memory, since this usually increases register 
pressure instead of decreasing it. As an example, Q as- 
signs the following gadgets from rsync to implement 
the gadget arrangement in Figure 3: 


; Load value into %eax 
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pop %ebp; ret; xchg %Seax, %tebp; ret 


; Load address-0x14 into 
pop %ebp; 
; Store memory 

0x14 (Sebx) ; 


Sebx 
pop %ebx; ret 


mov %eax, ret 

Output Finally, as long as at least one of the gadget 
arrangements has been assigned compatible gadgets, Q 
prints out payload bytes that Alice can use in her exploit. 
If Alice already has an exploit that no longer works be- 
cause of W@X and ASLR, she can feed in the generated 
ROP payload along with her old exploit to the second 
phase of Q (see Section 5) to harden her exploit against 
these defenses. 

We now explain each stage of Q in more detail. 


4.2 Gadget Discovery 


Not every instruction sequence can be used as a gadget. 

Q requires each gadget to satisfy four properties: 

Functional Each gadget has a type (from Table 2) that 
defines its function. In our system, a gadget’s type 
is specified semantically by a boolean predicate that 
must always be true after executing the gadget. 

Control Preserving Each gadget must be capable of 
transferring control to another gadget. In our system, 
this means that the gadget must end with ret or 
some semantically equivalent instruction sequence 
(e.g.,pop %eax; jmp *%eax). 

Known Side-effects The gadget must not have unknown 
side-effects. For instance, the gadget must not write 
to any undesired memory locations. 

Constant Stack Offset Most gadget types require the 
stack pointer to increase by a constant offset after 
each execution. 


Although we found these requirements to work well, 
we discuss alternatives to the control preservation and 
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known side-effects requirements in Section 8. 


4.2.1 Gadget Types 


The set of gadget types in Q defines a new instruction 
set architecture (ISA) in which each gadget type func- 
tions as an instruction. At a high-level, we specify the 
meaning of each gadget type with a postcondition 6 that 
must be true after executing it. Prior work has used dif- 
ferent mechanisms for specifying gadget types, including 
pattern matching on assembly instructions [21, 38] and 
expression tree matching [16]. We found postconditions 
to be more natural than these mechanisms. An instruction 
sequence Z satisfies a postcondition B if and only if the 
post condition is true after running Z from any starting 
state. The starting state consists of assignments to reg- 
isters and memory. The full list of gadget types that Q 
can recognize is in Table 2, along with the corresponding 
semantic definition postconditions. 


4.2.2 Semantic Analysis 


Given an instruction sequence Z and a semantic definition 
B, Q must decide if Z will satisfy B. For this, we use a 
well-known technique from program verification for com- 
puting the weakest precondition of a program [15, 17, 23]. 
At a high level, the weakest precondition WP(Z, B) for 
instructions Z and postcondition B is a boolean precon- 
dition that describes when Z will terminate in a state 
satisfying B. 

We use weakest preconditions in Q to verify whether 
the semantic definition of a gadget always holds after 
executing the instruction sequence Z. To do this, we 
check if 


WP(Z, B) = true. (1) 


If this formula is valid, then 6 always holds after execut- 
ing Z, and we can conclude that Z is a gadget with the 
semantic type B. 

Our first prototype used only this semantic analysis. 
We found that it was too slow to be practical. We sped 
up the entire process by performing a number of random 
concrete executions, and evaluating each 6 concretely to 
see if it was true. If 6 was false for any concrete input, 
then the instruction sequence could not be a gadget for 
that gadget type. Thus, we only need to invoke the more 
expensive weakest precondition process when B is true 
for every random concrete execution. 

Random concrete execution can also be used to infer 
possible parameter values (shown in Table 2) using dy- 
namic analysis. For instance, by looking at the values 
of all registers, and the addresses that memory was read 
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from, Q can compute a set of possible offsets for the 
LOADMEMG gadget type. 


As an example of how a gadget type is tested, con- 
sider the LOADMEMG gadget type in Table 2. LOAD- 
MEMG gadgets operate on two registers: the output reg- 
ister and the address register. Each LOADMEMG gadget 
has two parameters that are specific to a particular in- 
struction sequence Z. These will be found using dynamic 
analysis as described above. For instance, the instruc- 
tion sequence movl Oxc(%eax), %ebx; retisa 
LOADMEMG gadget with parameters {# Bytes ~ 4} and 
{Offset « 12} and registers {OutReg « %ebx} and 
{AddrReg + %eax}. The semantics for this instruction 
sequence would be %ebx < M[%eax + 12]. Q converts 
this to final(%ebx) = initial(M|%eax + 12]), 
which is the postcondition B that is checked for validity. 


4.2.3 Gadget Discovery Algorithm 


Our techniques for gadget discovery consist of two algo- 
rithms. The first, shown in Algorithm 1, tests whether 
or not the semantics of an instruction sequence Z match 
those of any gadget type using randomized concrete test- 
ing and validity checking of the weakest precondition. Al- 
gorithm | also outputs some metadata (not shown) about 
each gadget for use in other Q algorithms, including the 
gadget’s address, stack offset, and any registers that the 
gadget clobbers. The second algorithm iterates over the 
executable bytes of the source program, disassembles 
them, and calls the first algorithm as a subroutine. This 
is similar to the Galileo [41] algorithm, and so we do not 
replicate it here. 

Algorithm 1 Automatically test an instruction sequence 
T for gadgets 





Input: Z, numRuns, gadgetT ypes|| 
for i = 1 tonumRuns do 
outStateli] + Z( Random input) 
end for 
5: for gtype € gadgetT ypes do 
B < postconditions|gtype] 
consistent <— true 
for 7 = 1 to numRuns do 
if B(outState[j]) = false then 
10: consistent + false 
end if 
end for 
if consistent = true then {Possibly a gadget of type gtype} 
F + wp(Z,B) 
15: if decisionProc(F = true) = Valid then 
output {Output gadget T as type gtype} 
end if 
end if 
end for 
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Name Input Parameters Semantic Definition 

NoOpG —_ —_— Does not change memory or registers 
JUMPG AddrReg Offset EIP + AddrReg + Offset 
MOVEREGG InReg, OutReg —_— OutReg «+ InReg 

LOADCONSTG OutReg, Value _ OutReg + Value 

ARITHMETICG InReg1, InReg2, OutReg | 0» OutReg «+ InRegl (>, InReg2 
LOADMEMG AddrReg, OutReg # Bytes, Offset OutReg + M[AddrReg + Offset] 
STOREMEMG AddrReg, InReg # Bytes, Offset M[AddrReg + Offset] <- InReg 
ARITHMETICLOADG OutReg, AddrReg # Bytes, Offset, O, | OutReg 0,< M[AddrReg + Offset] 
ARITHMETICSTOREG | InReg, AddrReg # Bytes, Offset, O, | M[AddrReg + Offset] 0,<- InReg 











Table 2: Types of gadgets that Q can find. M[addr] means accessing memory at address addr. }, means an arbitrary 
binary operation. a < b denotes that final value of a equals the initial value of b. X O,< Y is short for KX — X Oz Y. 


4.3 Gadget Arrangement 


Qacts similar to a compiler — it reads in programs written 
in QooL (discussed below) and tries to implement them in 
terms of the gadgets shown in Table 2. The gadgets define 
an instruction set architecture. Thus, we can use some 
techniques from compiler theory. However, Q must deal 
with several hard problems not faced by most compilers: 


e Only a few registers can be used for moving, access- 
ing memory, and performing arithmetic operations. 

e Most instructions will clobber (modify) the majority 
of available registers. 

e Some instruction types may not be available at all. 


Although we use existing compiler techniques when 
possible, many of the standard compiler techniques break 
down. 


4.3.1 Q’s Language: QooL 


Users write the target program in Q’s high level language, 
QooL, which is displayed in Table 3. QooL enables 
the user to easily interact with the exploited program’s 
environment. For instance, the attacker can do this by 
calling a function (e.g., system), overwriting values in 
memory, or copying and running a binary payload (when 
WX is not present or has been disabled by first calling 
mprotect ora similar function). QooL is not Turing- 
complete; we discuss this further in Section 8. 


4.3.2 Arrangements 


One of the essential tasks of a compiler is to perform 
instruction selection, since there are many combinations 
of instructions that can implement a given computation. 
The gadget architecture is no exception, as there are many 
ways of combining gadget types to produce a particular 
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<exp> i:= 
LoadMem <exp> <type> 
| BinOp <binop_type> <exp> <exp> 
| Const <int> <type> 
<stmie> 23> 
StoreMem <exp> <exp> <type> 
| Assign <var> <exp> 
| CallExternal <func> <exp list> 


| Syscall 


Table 3: Grammar for our high level language, QooL. 


computation. We specify each combination of gadgets 
using a gadget arrangement. 

A gadget arrangement is a tree in which the vertices 
represent gadget types”, and an edge labeled type from a 
to b means that the output of gadget a is used for the type 
input in gadget b. An example arrangement is shown in 
Figure 3. 

One simple algorithm for performing instruction selec- 
tion (or selecting a gadget arrangement, in our case) is 
the maximal munch algorithm [1]. Maximal munch as- 
sumes that any instruction selected as the best will always 
be available for use. This assumption makes sense in a 
traditional compiler, since on a normal architecture there 
are few restrictions on when instructions can be used. 

A gadget arrangement algorithm cannot make such 
assumptions. Any particular gadget type chosen by max- 
imal munch might not be available at that point in the 
program because Q did not find any or the registers in the 
gadgets are not compatible with other gadgets needed. 

Instead of using maximal munch, Q employs every 
munch. Rather than selecting only one arrangement of 


5Vertices also include parameters that are relevant to the computation, 
such as binary operator type and number of bytes for memory operations. 
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LoadReg 





StoreMem, u32 


Figure 3: A gadget arrangement for storing a constant 
value to a constant address. A possible schedule for the 
arrangement is denoted by the time slots T,’s. 


gadget types as maximal munch would, every munch 
lazily builds a tree representing all possible ways that 
gadget types can be arranged to perform a computation. 
This is done by recursively applying munch rules to the 
program being compiled. 


4.3.3. Munch Rules 


Each QooL language construct has at least one munch 
rule that can implement the construct in terms of the 
implementations of its subexpressions. For instance, the 
obvious munch rule for the StoreMem statement is to 
use a STOREMEMG gadget, which we show below in 
ML-style pseudo code. 





munch = function 
| StoreMem(el, e2, t) —> 
let ell = munch el in 
let e21 = munch e2 in 
(* For each elg,e2g in Cartesian 
product of ell and e21 do: *) 
add_output (StoreMemG(addr=elg , 
value=e2g, typ=t)); 


NAYADNFWN KF 











Our initial implementation only contained these obvious 
rules. We quickly found that it could not find payloads 
for most binaries. 

We found that, in practice, many binaries do not contain 
gadgets for directly storing to memory (STOREMEMG 
in Table 2). We provide evidence of this in Section 7.1. 
However, if Q can learn or set the value in memory to 0 
or -1, it can use an ARITHMETICSTOREG gadget with 
mathematical identities to write an arbitrary value. As 
one example, Q can write zero to memory by bitwise 
and’ing the memory location with zero, and then adding 
the desired number. The example below shows the com- 
plicated return oriented program Q discovered for writing 
a single byte to memory with bitwise or, using gadgets 
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from apt-get. More straightforward options were not 
available. 


; Load eax: -1 
pop %*ebp; ret; xchg %eax, %*ebp; 
; Load ebx: address-—0x5e5b3cc4 
pop %ebx; pop %ebp; ret 
; Write -1 
or Sal, Ox5e5b3cc4 (%ebx) ; 
pop %ebp; ret 
; Load eax: value + 1 
pop %ebp; ret; xchg %eax, %ebp; 
; Load ebp: address—-Oxf3774ff 
pop %ebp; ret 
; Add value + 1 
add %al,0xf3774ff (Sebp) ; 
movl S0Ox85, %dh; ret 


ret 





pop sedi; 


ret 





4.4 Gadget Assignment 


Q must determine if a gadget arrangement can be satisfied 
using the gadgets it discovered in the source program. 
This process is called gadget assignment. The goal is 
to assign gadgets found during discovery to the vertices 
of arrangements, and see if the assignment is compati- 
ble. After a successful gadget assignment, the output is 
a mapping from gadget arrangement vertices to concrete 
gadgets. It is straightforward to print a ROP payload with 
this mapping. 

Gadget assignments need a schedule, since the gadgets 
must execute in some order. Selecting a valid schedule 
is not always easy because there are data dependencies 
between different gadgets. For instance, if the gadget at 
Tz clobbers (overwrites) the Value register in Figure 3, the 
gadget at T3 will not receive the correct input. To resolve 
such dependencies between gadgets, a gadget assignment 
and corresponding schedule must satisfy these properties: 


Matching Registers Whenever the result of gadget a is 
used as input type to gadget b, then the two registers 
should match, i.e., OutReg(a) = InReg(b, type). 

No Register Clobbering If the output of gadget a is 
used by gadget b, then a’s output register should not 
be clobbered by any gadget scheduled between a and 
b. For example, for the schedule shown in Figure 3, 
the LOADCONSTG operation during Tz should not 
clobber the result of the previous LOADCONSTG 
that happened during T;. 


We say that a gadget assignment and schedule are com- 
patible when the above properties hold, and that a gadget 
arrangement that has a compatible assignment and sched- 
ule is satisfiable. 

Although deciding whether a given gadget schedule 
and assignment are compatible is straightforward (i.e., 
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just ensure the above properties are satisfied), creating a 
practical algorithm to search for satisfiable arrangements 
is more complicated. The most straightforward approach 
is to iterate over all possible arrangements, schedules, and 
assignments, but this is simply too inefficient. 

Instead, our key observation is that if a gadget arrange- 
ment GA is unsatisfiable, then any GA’ that contains GA 
as a subtree is unsatisfiable as well. Our algorithm at- 
tempts to satisfy iteratively larger subtrees until it fails, or 
has satisfied the entire arrangement. If the algorithm fails 
on a subtree, it aborts the entire arrangement. Since most 
arrangements are unsatisfiable, this saves considerable 
time. (If most arrangements are satisfiable, the search will 
not take very long anyway.) 

Our assignment algorithms are found in Algorithms 
2 and 3. Algorithm 2 is a naive search over a schedule 
for all possible gadget assignments. Algorithm 3 is a 
caching wrapper that caches results and calls Algorithm 2 
on iteratively larger subtrees. It stops as soon as it finds a 
subtree which cannot be satisfied. Q calls Algorithm 3 on 
each possible gadget arrangement until one is satisfiable 
or there are none left. 

The algorithms make use of several data structures: 


e C: V > {0,1,?} is a cache that maps a gadget 
arrangement vertex to one of true, false, or unknown. 

e S: V + N represents the current schedule as a one- 
to-one mapping between each vertex and its position 
in the schedule. 

e G: V > Gis the current assignment of each vertex 
to its assigned gadget. 


Q can also search for assignments that meet other con- 
straints. For instance, Q can search for assignments that 
would result in a payload smaller than a user-specified 
size. This is useful because ROP payloads are typically 
larger than conventional payloads, and vulnerabilities usu- 
ally limit the number of payload bytes that can be written. 


5 Creating Exploits that Bypass ASLR and 
Wwex 


In the previous section, we described how to generate 
return oriented payloads. If an attacker can redirect exe- 
cution to the payload in the memory space of the vulnera- 
ble program by creating an exploit, then the computation 
specified by the payload will occur. In this section, we 
explain how Q can automatically create such an exploit 
when given an input exploit that does not bypass ASLR 
and W@X. 

We call this the exploit hardening problem. Specifi- 
cally, in the exploit hardening problem we are given a 
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Algorithm 2 Find a satisfying schedule and gadget as- 
signment for GA 


Input: S, G, nodeNum 
V <S~1(nodeNum) {Obtain vertex in GA for nodeNum} 
if V = _L then {Base case to end recursion} 
return true 
5: end if 
gadgets + GADGETSOFTYPE(GADGETTYPE(V)) 
for all g € gadgets do 
if ISCOMPATIBLE(G, nodeNum, g) then {Ensure g is com- 
patible with all gadgets before time slot nodeNum} 
if Algorithm 2(S, G[V < g], nodeNum + 1) then {Try 
to schedule later schedule slots} 
10: return true 
end if 
end if 
end for 
return false {No gadgets matched} 








Algorithm 3 Iteratively try to satisfy larger subtrees of a 
GA, caching results over all arrangements. 
Input: GA, C 
for all GA’ € SUBTREES(GA) do {In order from shortest to 
tallest} 
if C(GA’) =? then 
C(GA’) «  existsS © SCHEDULES(GA’) such that 
Algorithm 2(S, EMPTY, 0) = true 
5: endif 
if C(GA’) = false then {Stop early if a subtree cannot be 
satisfied} 
return false 
end if 
end for 
10: return C(GA) {Return the final value from the cache} 








program P and an input exploit that triggers a vulnera- 
bility. The input exploit can be an exploit that does not 
bypass defenses, or can even be a proof of concept crash- 
ing input. The goal is to output an exploit for P that 
bypasses W@X and ASLR. 


Intuitively, the input exploit should provide useful in- 
formation about a vulnerability in P. Q uses this infor- 
mation to consider other inputs that follow the execution 
path of the input exploit (i.e., the sequence of conditional 
branches and jumps taken by an execution of the input) 
on P, and attempts to find a new input that uses a return- 
oriented payload instead (Section 4). 


Q does not always succeed (e.g., sometimes it returns 
with no exploit), but we show that it works for real Linux 
and Windows vulnerabilities in Section 7. The fact that 
our system works with even a few real exploits means that 
an attacker can sometimes download an exploit and au- 
tomatically harden it to one that works even when W@X 
and ASLR are enabled. 
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5.1 Background: Generating Formulas 
from a Concrete Run 


There can be a very large number of inputs along the 
vulnerable path. Rather than trying to reason about each 
input individually, we build a logical constraint formula 
representing all inputs that follow the vulnerable path. 
Such constraint formulas have been used in many research 
areas, including automatic test case generation, automatic 
signature creation, and others [5, 7, 23, 40]. 


Generating constraint formulas from an input involves 
two steps. First, we record at the binary level the concrete 
execution of the vulnerable program running on the input 
exploit; we call such a recording an execution trace. Our 
recording tool incorporates dynamic taint analysis [11, 33, 
40] to keep track of which instructions deal with tainted 
(or input-derived) data. Our tool uses this information 
to 1) record only the instructions that access or modify 
tainted data, for performance reasons; and 2) halt the 
recording once control-hijacking takes place (i.e., when 
the instruction pointer becomes tainted). 


After recording the concrete execution, Q symbolically 
executes [7, 40] the target program, following the same 
path as in the recording. Symbolic execution is similar 
to normal execution, except each input byte is replaced 
with a symbol (e.g., s; for input byte 7). Any computation 
involving a symbolic input is replaced with a symbolic 
expression. Computations not involving a symbolic input 
are computed as normal (i.e., using the processor). Any 
constraints on the inputs to ensure that execution would 
be guided down the same path as the execution trace are 
stored in the constraint formula II. 


Before performing any analysis, we use the Binary 
Analysis Platform [3] to raise binary code into an inter- 
mediate language that is better suited to program analysis. 
This frees our analysis from needing to understand the 
semantics of each assembly instruction. 


5.2 Exploit Constraint Generation 


The constraint formula II describes all inputs that follow 
the vulnerable path. In this paper, we are only interested 
in inputs that hijack control to our desired computation. 
We build two constraints, a (control flow) and © (com- 
putation), that exclude any inputs that do not work as 
exploits. @ maps to true only if a program’s control flow 
has been diverted, and 4: maps to true only if the payload 
for some desired computation is in the exploit. 
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5.2.1 Assuring Control Flow Hijacking 





a takes the form jumpExp = targetExp, where 
jumpExp is the symbolic expression representing the 
target of the jump that tainted the instruction pointer, and 
targetExp depends on the type of exploit. 

The value of jumpExp can be obtained from the ex- 
ecution trace. Since the trace halts when the program 
jumps to a user-derived address, jumpExp is simply the 
symbolic expression for the target of this jump. Consider 
the following program. 











1 |x := 2*xget_input () 
goto x 











Our trace system would halt the above program at Line 
2, because the program jumps to a user-derived address. 
The symbolic jump expression from symbolic execution 
of the program is 2 * sy. a for this program would be 
2* 8, = targetExp. 

For a typical stack exploit, targetExp = 
&(shellcode), where & means the address of. With 
a return oriented payload, this would usually be 
targetExp = &(ret). This assumes that the ROP 
payload is located in memory at the address in %esp. 
If not, Q can use a pivot, which its ROP system 
can automatically find. For instance, targetExp = 
&(xchg %eax, %esp; ret) would transfer control 
to the ROP payload pointed to by %eax. 








5.2.2 Assuring Computation 


Computation constraints ensure that the computation pay- 
load is available in memory at the proper address at 
the time of exploitation. For instance, computation con- 
straints fora st rcpy buffer overflow would be unsatisfi- 
able for a payload containing a null byte, since this would 
result in only part of the payload being copied. 

Computation constraints take the form 4% = 
(mem[payloadBase] = payload[0] A... A 
mem[payloadBase + n] = payload|n]), where 
pay loadBase denotes the starting address of the pay- 
load in memory, and payload denotes the bytes in the 
payload (e.g., the ROP payload from Section 4). When 
using a basic ROP payload, pay loadBase will be set 
to %esp, since that is where a ret will start executing. 
When using a pivot, this value will depend on the pivot in 
the natural way. 


5.2.3 Finding an Exploit 


By combining these constraints with II, which only holds 
for inputs following the vulnerable path, we can create a 
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constraint formula that only describes exploits along the 
vulnerable path: 


IAaA®. (2) 


Any assignment to the initial program state that satisfies 
this constraint formula is an exploit for the program se- 
mantics recorded in the trace. We use an off the shelf 
decision procedure, STP [19], to solve the formulas. 


6 Implementation 


The ROP component (Section 4) of Q is built on top 
of the BAP framework [3]. The implementation for the 
gadget discovery, arrangement, and assignment phases 
comprises 4,585 lines of ML code. The ROP system uses 
the STP [19] decision procedure to determine the validity 
of generated weakest preconditions. 

Q’s exploit hardening component (Section 5) itself con- 
sists of a tracing (recording) component and an analysis 
component. We implemented the tracing tool using the 
Pin [28] framework, which allows analysis code to in- 
strument a running process and take measurements in 
between instruction execution. Our tool is optimized to 
only record instructions that are considered to be user- 
derived; the user can mark any input coming from files, 
network sockets, environment variables, or program ar- 
guments as being user-derived, and can record processes 
that fork (e.g., network daemons). The tracing component 
is written in C++, and includes 2,102 lines of code written 
for this project. 

The analysis portion of the hardening system is imple- 
mented in the BAP [3] framework. It consists of com- 
ponents that 1) lift the recorded assembly instructions 
into the BAP intermediate language, 2) symbolically exe- 
cute the trace, obtaining the constraint formula IT, and 3) 
compute the constraints @ and \. Our analysis tool then 
uses STP [19] to find a satisfying answer to the resulting 
constraint formula, and uses the result to build an exploit. 
It also fully understands Windows SEH (structured excep- 
tion handler) exploits, in which the exception handler is 
overwritten. The analysis implementation is written in 
ML, and includes 1,090 new lines of code for this project. 

All components of Q are fully capable of reasoning 
about Windows and Linux binaries. 


7 Evaluation 


We evaluate Q’s capabilities to produce ROP payloads 
and harden exploits in this section. 
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7.1 Return Oriented Programming 


Applicability We would like to know how often Q can 
build ROP payloads when given a random source program 
P. To evaluate this, we ran Q on all of the 1,298 ELF 
programs in /usr/bin onan author’s Ubuntu 9.10 desk- 
top machine and tried to generate various return oriented 
payloads. We then discarded the results for the 66 pro- 
grams that were marked as ASLR-compatible (PIE). We 
used Linux programs for our corpus because it is easier to 
gather a typical set of Linux programs than for Windows. 
For each program P, we consider if Q can create a ROP 
payload to: 


Call functions also called by P External functions 
called by P have an entry in the program’s Proce- 
dure Linkage Table (PLT). Q calls the PLT entries 
directly; if the external function has not been loaded, 
the dynamic loader will be invoked to load it before 
transferring control to the called function. 

Call external functions in libe Calling external func- 
tions that do not have a PLT entry is more com- 
plicated. For this, we build on a technique for cal- 
culating the address of functions in libc even when 
libc is randomized [39]. This involves more compu- 
tation than the above case, and so is more likely to 
be unsatisfiable. 

Write to memory We consider a payload that writes 
four bytes to an arbitrary address. 


For each of the target programs above, we measure 
whether our system can create a payload for it using in- 
struction sequences taken from each source program in 
our corpus. We consider an attempt successful if our sys- 
tem successfully builds a payload. Note that the attacker 
must still find a way to load the payload into memory and 
redirect control to it for it to be used as an exploit. 

The results of this experiment are shown in Figure 4. 
The probability of success for the above payload types is 
plotted as a function of source program size. The Call/- 
Store line represents the Call functions also called by P 
and Write to memory cases above, since the results are 
visually indistinguishable. The Call (libc) line represents 
Call external functions in libc. 

The results support the claim that ROP is more difficult 
when there is less binary code. Even so, Q is able to 
call linked functions and store arbitrary memory bytes 
to arbitrary locations in 80% of binaries that are at least 
20KB. Q can also call any function in libe in 80% of 
binaries 100KB or larger’. 


©The fact that Q generated payloads for so many binaries was dis- 
turbing to the author whose machine the programs came from. 
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Figure 4: The probability that Q can generate various payload types, shown as a function of source file size. As expected, 
the probability grows with file size. The percentage is calculated over non position independent executables. Q can call 
linked functions in 80% of programs that are 20KB or larger, and can call any function in linked shared libraries in 80% 


of programs that are at least 1OOKB in size. 


Efficiency While we found that semantic gadget discov- 
ery techniques are useful for finding gadgets, they are not 
very fast. In our implementation, we found that adding a 
concrete randomized testing stage increased Q’s perfor- 
mance. To measure this, we collected a random sample of 
32 programs from our /usr/bin dataset and ran gadget 
discovery. For each program, we ran Q twice, once with 
randomized testing enabled, and once disabled. Figure 5 
shows a boxplot of the elapsed wall times when running 
with 16 active threads. (The time difference would be 
greater with fewer threads, but the experiment would take 
a very long time to complete for the non-randomized 
cases.) As expected, Q runs faster when randomized 
testing is enabled. 


1500 


Seconds 


500 





8 
° 
——— 


W/ Rand. WO/ Rand. 
Figure 5: Boxplots of the time it takes to discover gadgets 
from a program for a random sample of 32 programs, 
when randomized testing is enabled and disabled. 


Sizes Our results from Figure 4 show that larger pro- 
grams are generally easier to build return oriented pro- 
grams from. Figure 6 shows the sizes of the programs in 
our experiments, and compares them to the binaries used 
in prior research, libc [38], the iPhone library [16], and 
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the windows kernel [21]. We note that these binaries are 
significantly larger than most /usr/bin programs. 


Gadget Frequency Figure 7 shows the frequency of 
various types of gadgets in programs larger than 20KB.’ 
It offers some insight on why ROP on small binaries 
is difficult. The most useful gadget types, like STORE- 
MEMG and LOADMEMG, are not very common. Instead, 
combined gadgets like ARITHMETICSTOREG are more 
prevalent. This is not surprising, given that compilers 
try to combine operations to optimize efficiency. These 
results are what inspired Q’s gadget arrangement system, 
which can cope with missing gadget types. 


7.2 Exploit Hardening 


To evaluate exploit hardening, we tested it with a variety 
of publicly available exploits for Linux and Windows. 
We consider each experiment a success if Q can harden 
a public exploit for real software by producing working 
exploits that bypass W@X and ASLR. We do not expect 
that our system will always produce a hardened exploit. 
We compiled each vulnerable program from source 
when possible, disabled all defenses (including ASLR and 
WX), and then verified that the exploit at least crashed 
the vulnerable program. We then ran the exploit through 
the exploit hardening component of Q, and created two 
payloads that bypass W@X and ASLR. These payloads 
1) call a linked function and 2) call system(*‘*w’’) 


7These results are after a pre-processing step that throws away re- 
dundant gadgets. A gadget gi is redundant to go if they both have the 
same type and input registers, and gi clobbers a superset of the registers 
that go clobbers. This is why there is only one NOOPG gadget type 
listed for all programs, even though every ret instruction can be used 
as a NOOPG. 
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Figure 6: The empirical cumulative distribution function of the file sizes in /usr/bin. In this graph, a point at (a, y) 
means that 100y percent of the files in /usr/bin have a size less than or equal to x bytes. We also show the sizes of 
the iPhone libsystem library [16], libc [38] and the windows kernel [21], which prior work has targeted. libc and the 
iPhone library are both larger than 95% of the programs in our corpus, while the windows kernel is larger than 99%. 
We plot dotted lines at 20 and 100KB for reference; these are the sizes at which Q works well, as shown in Figure 4. 
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Figure 7: The frequency of various types of gadgets in /usr/bin programs larger than 20KB. 





on Linux or WinExec (**calc.exe’’ ) on Windows. 
We tested these two exploits with ASLR and W@X en- 
abled. The results of these experiments are shown in 
Table 4. 

We found that our system was able to harden exploits 
for several large, real programs. In general, our sys- 
tem performed as expected: it only output exploits that 
worked, and in some cases reported it could not produce 
a hardened exploit. 


8 Discussion 


Ret-less ROP When we designed Q, no one had shown 
that ROP was possible without using ret-like instruc- 
tions. Since then, Checkoway, et al. have shown [8] that 
it is possible to create a Turing-complete gadget set that 
does not use ret instructions. Their gadgets have control 
flow preservation preconditions. For example, the gadget 
pop %eax; jmp *%edx only preserves control flow 
if %edx is preset to the next gadget address. Q does not 
make any assumptions about the preconditions for a gad- 
get when considering control flow preservation, which 
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prevents it from finding gadgets of the above form. We 
leave it as future work to determine whether it is possible 
to automatically construct ROP exploit payloads that do 
not use ret instructions. 


Side effects Q conservatively handles side effects by 
discarding any instruction sequence that might cause 
the program to crash, such as a pointer dereference. 


As one example, pushl %eax; popl %ebx; ret 
will move the value in %eax to %ebx. Since a 


MOVEREGG gadget does not intentionally use memory, 
however, Q would discard this gadget. We plan to add a 
more advanced memory analysis that can statically detect 
when a memory access will be safe, which will allow Q 
to use more gadgets. 


Turing completeness Q’s language for describing tar- 
get programs, QooL, is not Turing-complete. Our early 
tests revealed that the ARITHMETICG gadgets needed 
for conditional jumps, such as equality tests, were often 
unavailable in small programs. As a result, we focused on 
the gadgets needed for practical exploitation, rather than 
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Program Reference Tracing | Analysis | Call Linked | Call System | OS SEH 
Free CD to MP3 Converter | OSVDB-69116 89s 41s | Yes Yes Win | No 
FatPlayer CVE-2009-4962 90s 43s | Yes Yes Win | Yes 
A-PDF Converter OSVDB-67241 238s 140s | Yes Yes Win | No 
A-PDF Converter OSVDB-68132 215s 142s | Yes Yes Win | Yes 
MP3 CD Converter Pro OSVDB-6995 1 103s 55s | Yes Yes Win | Yes 
rsync CVE-2004-2093 60s 5s | Yes Yes Lin NA 
opendchub CVE-2010-1147 195s 30s | Yes No Lin NA 
gv CVE-2004-1717 113s 124s | Yes Yes Lin NA 
proftpd CVE-2006-6563 30s 10s | Yes Yes Lin NA 




















Table 4: A list of public exploits hardened by Q. For each exploit, we record how long the trace and analysis components 
took to run, and report if Q produced hardened exploits that call 1) a linked function, and 2) system or WinExec. 


striving for Turing-completeness. 


9 Related Work 


Return Oriented Programming Krahmer was the first 
to propose using borrowed code chunks [25] from the pro- 
gram text to perform meaningful actions. Later, Shacham 
showed in his seminal paper [41] on ROP that a set of 
Turing complete gadgets can be created using the pro- 
gram text of libc. Shacham developed an algorithm that 
put instruction sequences into trie form to help a human 
manually select useful instruction sequences. 

Since then, several researchers have investigated how 
to more fully automate ROP [16, 21, 38]. Dullien and 
Kornau [16, 24] automatically found gadgets in mobile 
support libraries (on order of 1,000KB), and Roemer [38] 
demonstrated it was possible to automatically discover 
gadgets in libe (1,300KB). Hund [21] used gadgets 
from ntoskrnl.exe (3,7/00KB) and win32k.sys 
(2,200KB). In contrast, our techniques often only have 
20KB of binary code to create gadgets from, because 
generally only small code modules are unrandomized in 
user-mode exploitation contexts. Previous work focusing 
on such small code bases was mostly or entirely manual; 
for instance, Checkoway, et al. manually crafted a Turing 
complete set of gadgets from 16KB of Z80 BIOS [9]. 


Automatic Exploitation Our exploit hardening system 
(Section 5) is related to existing automatic exploitation 
research [2, 5, 20, 26]. In automatic exploitation, the goal 
is to automatically find an exploit for a bug when given 
some starting information (such as a patch [5], guiding 
input [20, 26], or program precondition [2]). Some auto- 
matic exploitation research focuses on creating an input 
that triggers a particular vulnerability [5, 18, 26], but does 
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not focus on control flow exploitation, which is one of the 
focuses of our work. Our techniques can use the inputs 
produced by these projects as an input exploit, and harden 
them so that they bypass W@X and ASLR. 

We are only aware of one other project that considers 
creating an exploit given another exploit [20]; in this case 
the input exploit only causes a crash. Our work uses 
symbolic execution to reason about other inputs that take 
the same path as the input exploit. In contrast, Heelan [20] 
tracks data dependencies between the desired payload 
bytes and the input bytes, but does not ensure that control 
flow will stay the same and preserve the observed data 
dependencies. As a result, his approach is heuristic in 
nature, but is likely to be faster than ours. 


Related Attacks Other researchers have previously 
used simple ROP gadgets in the .text section of bi- 
naries to calculate the address of functions in libc [39]. 
Unfortunately, this is insufficient to make arbitrary func- 
tion calls when ASLR is enabled, because many functions 
require pointers to data. Recall from Section 2 that all 
modern operating systems except for Mac OS X random- 
ize the stack and heap, thus making it difficult for an 
attacker to introduce argument data and know a pointer to 
its address. QooL (Section 4.3.1) allows target programs 
to write payloads to known addresses, typically in the 
. data segment, which eliminates this problem. 

A recent attack developed concurrently with Q [27] can 
also write data to known constant memory locations, and 
thus can also make arbitrary function calls in the W@X 
and ASLR setting. This attack uses repeated st rcpy 
return-to-libc calls to copy data from the binary itself to 
a specified location. In contrast, our attack uses ROP 
gadgets discovered by Q. 

There are specialized attacks against W@X and ASLR 
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that are only applicable inside of a browser, such as JIT 
spraying [4, 43]. The downside is that they are not appli- 
cable to all programs. 


Related Defenses The most natural way of defeating 
ROP is to randomize all executable code. For instance, we 
are not able to deterministically attack position indepen- 
dent executables in Linux, because we do not know where 
any instruction sequences will be in memory. Operating 
systems have chosen not to randomize all code in the past 
because of performance and compatibility issues; these 
reasons should now be reevaluated considering the new 
evidence that allowing even small amounts of unrandom- 
ized code can enable an attacker to use ROP payloads. 

Other defenses against ROP exist. One defense is to 
dynamically instrument running programs and look for 
sequences of instructions that contain returns with few 
instructions spaced between [10, 12]. The assumption 
is that normal code will generally execute non-trivial 
amounts of code in between ret instructions, whereas 
ROP code will not. 

A similar defense is to ensure that the call chain of a 
program respects the stack semantics, i.e., thata ret will 
only transfer control to a program location that previously 
executed a call instruction. Such techniques [13, 37] 
are implemented using a shadow stack that is maintained 
outside of normal memory space. Both of these defenses 
make the assumption that ROP must be performed using 
the ret instruction. 

Unfortunately for defenders, researchers [8] have re- 
cently shown that it is possible to perform ROP on x86 
without using ret instructions at all, which is enough to 
bypass these schemes without modifications. However, 
the proof of concept techniques required access to large 
libraries, which are randomized in modern operating sys- 
tems. It remains an open question whether such attacks 
are possible in modern user-mode exploitation contexts, 
when little unrandomized code is available. 


10 Conclusion 


We developed return oriented programming (ROP) tech- 
niques that work on small, unrandomized code bases as 
found in modern systems. We demonstrated that it is pos- 
sible to synthesize ROP payloads for 80% of programs 
larger than 20KB, implying that even a small amount of 
unrandomized code is harmful. We also built an end- 
to-end exploit hardening system, Q, that reads as input 
an exploit that does not bypass defenses, and automati- 
cally hardens it to one that bypasses ASLR and W@X. 
Our techniques and experiments demonstrate that current 
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ASLR and WX implementations, which allow small 
amounts of code to be unrandomized, continue to allow 
ROP attacks. Operating system designers should weigh 
the dangers of such attacks against the performance and 
compatibility penalties imposed by randomizing all code 
by default. 
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Abstract 

The Trusted Platform Module (TPM) is commonly 
thought of as hardware that can increase platform secu- 
rity. However, it can also be used for malicious pur- 
poses. The TPM, along with other hardware, can imple- 
ment a cloaked computation, whose memory state cannot 
be observed by any other software, including the operat- 
ing system and hypervisor. We show that malware can 
use cloaked computations to hide essential secrets (like 
the target of an attack) from a malware analyst. 

We describe and implement a protocol that establishes 
an encryption key under control of the TPM that can only 
be used by a specific infection program. An infected host 
then proves the legitimacy of this key to a remote mal- 
ware distribution platform, and receives and executes an 
encrypted payload in a way that prevents software visibil- 
ity of the decrypted payload. We detail how malware can 
benefit from cloaked computations and discuss defenses 
against our protocol. Hardening legitimate uses of the 
TPM against attack improves the resilience of our mal- 
ware, creating a Catch-22 for secure computing technol- 


ogy. 
1 Introduction 


The Trusted Platform Module (TPM) has become a com- 
mon hardware feature, with 350 million deployed com- 
puters that have TPM hardware [14]. The purpose of TPM 
hardware, and the software that supports it, is to increase 
the security of computer systems. However, this paper ex- 
amines the question of how a malware author can use the 
TPM to build better malware, specifically malware that 
cannot be analyzed by white hat researchers. 

Trusted computing technology [42] adds computer 
hardware to provide security primitives independent from 
other system functionality. The hardware provides cer- 
tain low-level security guarantees directly. For example, 
it guarantees that only it can read and write certain data. 
Trusted software uses these low-level, hardware-enforced 
properties to build powerful guarantees for programmers. 

The TPM, as developed by the Trusted Computing 
Group (TCG), is one of the more popular implementations 
of trusted computing technology. The TPM has seen sig- 
nificant use in industry and government; the TPM is used 
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in Microsoft’s popular BitLocker drive encryption soft- 
ware [7] and the United States Department of Defense has 
required the TPM as a solution for securing data on lap- 
tops [4]. TPMs are regularly included on desktop, laptop, 
and server-class computers from a number of manufac- 
turers. The wide dissemination of TPM functionality is 
potentially a boon for computer security, but this paper 
examines the potential of the TPM for malware authors (a 
first to our knowledge). 


A malware writer can use the TPM for implementing 
cloaked computations which, combined with a protocol 
described in this paper, impede malware analysis. The 
TPM is used with “late launch” processor mechanisms 
(Intel’s Trusted Execution Technology [12, 8], abbrevi- 
ated TXT, and AMD’s Secure Startup mechanism [10]) 
that ensure uninterrupted execution of secure binaries. 
Late launch is a hardware-enforced secure environment 
where code runs without any other concurrently executing 
software, including the operating system. We demonstrate 
a protocol where a malware author uses cloaked com- 
putations to completely prevent certain malware func- 
tions from being analyzed and understood by any cur- 
rently available methods. TPM functionality ensures that 
a cloaked program will remain encrypted until it is run- 
ning directly on hardware. Assuming certificates for hard- 
ware TPMs identify these TPMs as hardware and cannot 
be forged, our malware will refuse to execute in a virtual- 
ized environment. 


Timely and accurate analysis is critical to the ability 
to stop widespread effects of malware. Honeypots are 
constantly collecting malware and researchers use cre- 
ative combinations of static analysis, dynamic emulation 
and virtualization to reverse engineer malware behavior 
[47, 30, 19, 24, 35, 36]. This reverse engineering is often 
crucial to defeating the malware. For example, once the 
domain name generation algorithm used for propagating 
the Conficker worm was determined, the Conficker ca- 
bal blocked the registration of those DNS names [45, 43], 
thereby defeating the worm. 


While the idea of using the TPM to cloak malware com- 
putation is conceptually straightforward, existing TPM 
protocols do not suffice and must be adapted to the task 
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of malware distribution. We clarify the capabilities of 
and countermeasures for this threat. Cloaking does not 
make malware all-powerful, and engineering malware to 
take advantage of a cloaked environment is a design chal- 
lenge. A cloaked computation runs without OS support, 
so it cannot make a system call or easily use devices like 
a NIC for network communication. This paper also dis- 
cusses best practices for TPM-enabled systems that can 
prevent the class of attacks we present. 

This paper makes the following contributions. 

e It specifies a protocol that runs on current TPM im- 
plementations that allows a malware developer to ex- 
ecute code in an environment that is guaranteed to be 
not externally observable, e.g., by a malware analyst. 
Our protocol adapts TPM-based remote attestation 
for use by the malware distribution platform. 

e It presents the model of cloaked execution and mea- 
sures the implementation of a malware distribution 
protocol that uses the TPM to cloak its computation. 

e It provides several real-world use cases for TPM- 
based malware cloaking, and describes how to adapt 
malware to use TPM cloaking for those cases. These 
include: worm command and control, selective data 
exfiltration, and a DDoS timebomb. 

e It discusses various defenses against our attacks and 
their tradeoffs with TPM security and usability. 


Organization In Section 2 we describe our threat model 
and different attack scenarios for TPM cloaked malware. 
Then in Section 3 we give TPM background information. 
We then describe and analyze a general TPM cloaked mal- 
ware attack in Section 4 and follow with a description of 
a prototype implementation in Section 5. 

We then turn to discussing future defenses against such 
attacks in Section 6; describe related work in Section 7 
and finally conclude in Section 8. 


2 Threat Model and Attack Scenarios 


We begin by describing our threat model for an attacker 
that wishes to use the TPM for cloaked computations. 
Then we describe multiple attack scenarios that can lever- 
age TPM cloaked computations. 


2.1 Threat model and goals 


We consider an attacker who wishes to infect machines 
with malware. His goal is to make a portion of this mal- 
ware unobservable to any analyst (e.g., white-hat security 
researcher, or IT professional) except for its input and out- 
put behavior. 

We assume an attacker will have the following capabil- 

ities on the compromised machine. 

e Kernel-level compromise. We assume our attack 
has full access to the OS address space. Late launch 
computation is privileged and can only be started by 
code that runs at the OS privilege level. Exploits 
that result in kernel-level privileges for commodity 
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OSes are common enough to be a significant con- 
cern. For example, in September and October 2010, 
there were 13 remote code execution vulnerabilities 
and 2 privilege escalation vulnerabilities that could 
provide a kernel-level exploit for Microsoft’s Win- 
dows 7 [13]. Kernel-level exploits for Linux are re- 
ported more rarely, but do exist, e.g., the recent Xorg 
memory management vulnerability [54]. There are 
many examples of malware using kernel vulnerabil- 
ities [34, 3]. 

e Authorization for TPM capabilities. We further as- 
sume our attack can authorize the TPM commands 
in our protocol. TPM commands are authorized us- 
ing AuthData, which are 160-bit secrets that will 
be described further in Section 3. The difficulty 
of obtaining AuthData depends on how TPMs are 
used in practice. To our knowledge, the TCG does 
not provide concrete practices for protecting Auth- 
Data. Most TPM commands do not require Auth- 
Data to be sent on wire, even in encrypted form. 
However, knowing AuthData is necessary for certain 
common TPM operations like using TPM-controlled 
encryption keys. We discuss acquiring the AuthData 
needed by the attack in Sections 3.6 and 4. 


An analyst will see all non-blackbox behavior of the 
attacker’s cloaked computation. In our model, the analyst 
is allowed full access to systems that run our malware. 
We assume that all network traffic is visible, and that the 
analyst will attempt to exploit any attack protocol weak- 
nesses. In particular, an analyst might run a honeypot that 
is intended to be infected so that he can observe and ana- 
lyze the malware. A honeypot may use a virtual machine 
(including those that use hardware support for virtualiza- 
tion like VMWare Workstation and KVM [33]), and may 
include any combination of emulated and real hardware, 
including software-based TPM emulators [50] and VM 
interfaces to hardware TPMs like that of VTPMs [17]. 


We assume the analyst is neither able to mount phys- 
ical attacks on the TPM nor is able to compromise the 
TPM public key infrastructure. (We revisit these as- 
sumptions when discussing possible defenses in Sec- 
tion 6.) While there are known attacks against Intel’s 
late launch environment [55] and physical attacks against 
the TPM [51, 32], manufacturers are working to eliminate 
such attacks. Manufacturers have significant incentive to 
defeat these attacks because they compromise the TPM’s 
guarantee that is currently its most commercially impor- 
tant: preventing data leakage from laptop theft. 


Our attack may be detectable because it increases TPM 
use. Nonetheless, frequent TPM use might be the norm 
for some systems, or users and monitoring tools may sim- 
ply be unaware that increased TPM use is a concern. 


A cloaked computation is limited to a computational 
kernel. It cannot access OS services or make a system 
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call. Any functional malware must have extensive sup- 
port code beyond the cloaked computation. The support 
code performs tasks like communication over the network 
or access to files. The attacker must design malware to 
split functionality into cloaked and observable pieces. Ar- 
guments can be passed to the computational kernel via 
memory, and may be encrypted or signed off-platform for 
privacy or integrity. 


2.2 Attack Scenarios 


We now describe various attack scenarios that leverage 
TPM cloaking. 


2.2.1 


We consider a modification of the Conficker B worm. The 
worm has an infection stage, where a host is exploited and 
downloads command and control code. Then the infection 
code runs a rendezvous protocol to download and execute 
signed binary updates. Engineers halted the propagation 
of Conficker B by reverse engineering the rendezvous pro- 
tocol and preventing the registration of domain names that 
Conficker was going to generate. 

Defeating Conficker requires learning in advance the 
rendezvous domain names it will generate. The sequence 
of domain names can be determined in two ways; first 
by directly analyzing the domain name generation imple- 
mentation or second by running the algorithm with inputs 
that will generate future domain names. Cloaked compu- 
tation prevents the static analysis and dynamic emulation 
required to reverse engineer binary code, eliminating the 
first option of analyzing the implementation. 

Conficker uses as input to its domain name generation 
algorithm the current day (in UTC). It establishes the cur- 
rent day by fetching data from a variety of web sites. 
White hat researchers ran Conficker with fake replies to 
these http requests, tricking the virus into believing it was 
executing in the future. 

However, malware can obtain timestamps securely at 
day-level granularity. Package repositories for common 
Linux distributions provide descriptions of repository 
contents that are signed, include the date, and are updated 
daily. (See http://us.archive.ubuntu.com/ 
ubuntu/dists/lucid-updates/Release for 
Ubuntu Linux, which has an accompanying “.gpg” 
signature file.) This data is mirrored at many locations 
worldwide and is critical for the integrity of package 
distribution’, so taking it offline or forging timestamps 
would be both difficult and a security risk. 

Conficker is not alone in its use of domain name gener- 
ation for rendezvous points. The Mebroot rootkit [31] and 
Kraken botnet [5] both use similar techniques to contact 
their command and control servers. 


Worm command and control 


‘Although individual packages are signed, without signed release 
metadata a user may not know whether there is a pending update for 
a package. 


USENIX Association 


Using cloaked computations for malware command 
and control does not ipso facto make malware more dan- 
gerous. Cloaked computations must be used as part of a 
careful protocol in order to be effective. 


2.2.2 Selective data exfiltration 


An infection program can exfiltrate private financial data 
or corporate secrets. To minimize the probability of detec- 
tion, the program rate limits its exfiltrated data. The pro- 
gram searches and prioritizes data inside a cloaked com- 
putation, perhaps using a set of regular expressions. 

Cloaked computation can obscure valuable clues about 
the origin and motivation of the infection authors. The 
regular expressions might target information about a par- 
ticular competitor or project. If white hats can sample the 
exfiltrated data, this would also provide clues; however, it 
would give less direct evidence than a set of search terms, 
and output could be encrypted. 

Stuxnet and Aurora are recent high profile attacks that 
exfiltrate data [38]. Stuxnet seeks out specific industrial 
systems and sends information about the infected OS and 
domain information to one of two command servers [26]. 
A program without cloaked computation could use cryp- 
tographic techniques [59, 18, 28] to keep search criteria 
secret while being observed in memory, but their perfor- 
mance currently makes them impractical. 


2.2.3 Distributed denial-of-service timebomb 


A common malware objective is to attack a target at a cer- 
tain point in time. Keeping the time and target secret until 
the attack prevents countermeasures to reduce the attack’s 
impact. A cloaked computation can securely check the 
day (as above), and only make the target known on the 
launch day. 

Malware analysis has often been important for stop- 
ping distributed denial-of-service (DDoS) attacks. One 
prominent example is MyDoom A. MyDoom A was first 
identified on January 26, 2004 [2]. The worm caused in- 
fected computers to perform a DDoS on www. sco.com 
on February 1, 2004, less than a week after the virus was 
first classified. However, the worm was an easy target for 
analysts because its target was in the binary obscured only 
by ROT-13 [1]. Since the target was identified prior to 
when the attack was scheduled, SCO was able to remove 
its domain name from DNS before a DDoS occurred [57]. 

The Storm  worm’s targeting of www. 
microsoft.com [46],  Blaster’s targeting of 
windowsupdate.com [34], and Code Red’s tar- 
geting of www.whitehouse.gov [22] are other 
prominent examples of DDoS timebombs whose effects 
were lessened by learning the target in advance of the 
attack. If timebomb logic is contained in cloaked code, 
then it increases the difficulty of detecting the time and 
target of an upcoming attack. Since the target is stored 
only in encrypted form locally on infected machines, the 


20th USENIX Security Symposium 397 


398 


infected machines do not have to communicate over the 
network to receive the target at the time of the attack. 

Not every machine participating in a DDoS coordinated 
by cloaked computation must have a TPM. A one-million 
machine botnet could be coordinated by one-thousand 
machines with TPMs (to pick arbitrary numbers). The 
TPM-containing machines would repeatedly execute a 
cloaked computation, as above, to determine when to be- 
gin an attack. These machines would send the target to 
the rest when they detect it is time to begin the DDoS. In 
the example, all million machines must receive the DDoS 
target, but the topology of communication is specialized 
to the DDoS task and therefore is more difficult to filter 
and less amenable to traffic analysis than a generic peer- 
to-peer system. 


3. TPM background 


This section describes the TPM hardware and support 
software in sufficient detail to understand how it can be 
used to make malware more difficult to analyze. 


3.1 TPM hardware 


TPMs are usually found in x86 PCs as small integrated 
circuits on motherboards that connect to the low pin 
count (LPC) bus and ultimately the southbridge of the PC 
chipset. Each TPM contains an RSA (public-key) cryp- 
tography unit and platform configuration registers (PCRs) 
that maintain cryptographic hashes (called measurements 
by the TCG) of code and data that has run on the platform. 

The goal of the TPM is to provide security-critical 
functions like secure storage and attestation of platform 
state and identity. Each TPM is shipped with a public en- 
cryption key pair, called the Endorsement Key (EK), that 
is accompanied by a certificate from the manufacturer. 
This key is used for critical TPM management tasks, like 
“taking ownership” of the TPM, which is a form of ini- 
tialization. During initialization the TPM creates a secret, 
tpmProof, that is used to protect keypairs it creates. 

The TPM 1.2 specification requires PC TPMs to have 
at least 24 PCRs. PCRs 0—7 measure core system com- 
ponents like BIOS code, PCRs 8-15 are OS defined, and 
PCRs 16-23 are used by the TPM’s late launch mecha- 
nism, where sensitive software runs in complete hardware 
isolation (no other program, including the OS, may run 
concurrently unless specifically allowed by the software). 
PCRs cannot be set directly, they can only be extended 
with new values, which sets a PCR so that it depends on 
its previous value and the extending value in a way that is 
not easily reversible. PCR state can establish what soft- 
ware has been run on the machine since boot, including 
the BIOS, hypervisor and operating system. 


3.2 Managing and protecting TPM storage 


The TPM was designed with very little persistent storage 
to reduce cost. The PC TPM specification only mandates 
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Concatenation of A and B A||B 


Public/private keypair for 
asymmetric encryption 
named name 


Encryption of data with a public Enc(PK, data) 
key 


(PKname; SKname) 
SPR, SK vanes 


Table 1: Notation for TPM data and computations. 


1,280 bytes of non-volatile RAM (NVRAM), so most data 
that the TPM uses must be stored elsewhere, like in main 
memory or on disk. When we refer to an object as stored 
in the TPM, we mean an object stored externally to the 
TPM that is encrypted with a key managed by the TPM. 
By contrast, data stored in locations physically internal to 
the TPM is stored internal to the TPM. 

AuthData controls TPM capabilities, which are the 
ability to read, write, and use objects stored in the TPM 
and execute TPM commands. AuthData is a 160-bit se- 
cret, and knowledge of the AuthData for a particular capa- 
bility is demonstrated by using it as a key for calculating a 
hash-based message authentication code (HMAC) of the 
input arguments to the TPM command. ” 

Public signature and encryption key pairs created by a 
TPM are stored as key blobs only usable with a particular 
TPM. The contents of a key blob are shown in Figure 2. A 
hash of the public portion of a key blob is stored in the pri- 
vate portion, along with tpmProof (mentioned above); 
tpmProof is an AuthData value randomly generated by 
the TPM and stored internally to the TPM when someone 
takes ownership. It protects the key blob from forgery by 
adversaries and even the TPM manufacturer.* 

In addition, a TPM user can use the PCRs to restrict use 
of TPM-generated keypairs to particular pieces of soft- 
ware that are identified via a hash of their code and initial 
data. For example, the TPM can configure a key blob so 
that it can only be used when the PCRs have certain values 
(and therefore only when certain software is running).* 


2Since AuthData is used as an HMAC key, it does not need to be 
present on the same machine as the TPM for it to be used. For exam- 
ple, a remote administrator might hold certain AuthData and use this 
to HMAC input arguments and then send these across a network to the 
machine containing the TPM. However, AuthData does need to be in 
memory (and encrypted) when the secret is first established for a TPM 
capability as part of a TPM initialization protocol. We investigate fur- 
ther the implications of this nuance in our discussions of defenses in 
Section 6. 

3Migratable keys are handled somewhat differently, but they are be- 
yond the scope of this paper. 

“Restricting a TPM-generated key to use with certain PCR values is 
not the same as the TPM_Seal command found in related literature. The 
two are similar, but the former places restrictions on a key’s use, while 
the later places restrictions on the decryption of a piece of data (which 
could be a key blob). 
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Figure |: The overall flow of the attack is 1) Infecting a system with local malware capable of kernel-level exploitation to coordinate 
the attack 2) Establishing a legitimate TPM-generated key usable only by the Infection Payload Loader in late launch via a multistep 
protocol with a Malware Distribution Platform 3) Delivering a payload that can be decrypted using the TPM-generated key 4) Using 
a late launch environment to decrypt the payload with the TPM-generated key, and running it with inputs passed into memory by 
local malware 5) Retrieving output from payload, potentially repeating step 4 with new inputs. Boxes with “TPM” indicate parts of 
the protocol that use the TPM. 


Blob((PK, SK) ex) = PubBlob((PK, $K)ez) || Enc(PK parent; PrivBlob((PK, SK )ex)) 
PubBlob((PK, S'K)ez) = PKez || PCR values 
PrivBlob((PK, SK )e,) = SKex || H(PubBlob((PK, SK )ex) || t(pmProof 


Figure 2: Contents of TPM key blob for an example public/private key pair named ez that is stored in the key hierarchy under a key 
named parent. For our purposes the parent key of most key blobs is the SRK. (Note that the PCR values themselves are not really 
stored in the key blob. Rather the blob contains a bitmask of the PCRs whose values must be verified and a digest of the PCR values.) 


TPM key storage is a key hierarchy: a single-rooted 
tree whose root is the Storage Root Key (SRK), and is 
created upon the take ownership operation described be- 
low. The private part of the SRK is stored internal to 
the TPM and never present in main memory, even in en- 
crypted form. Since the public part of the SRK encrypts 
the private part of descendant keys (and so on), all keys in 
the hierarchy are described as “stored in the TPM,” even 
though all of them, except the SRK, are stored in main 
memory. Using the private part of any key in the hierar- 
chy requires using the TPM to access the private SRK to 
decrypt private keys while descending the hierarchy. 

It is impossible to use private keys for any of the key- 
pairs stored in the TPM apart from using TPM capabil- 
ities: obtaining the private key for one key would entail 
decrypting the private portion of a key blob, which in turn 
requires the private key of the parent, and so on, up to the 
SRK, which is special in that its private key is never stored 
externally to the TPM (even in encrypted form). A TPM 
key hierarchy is illustrated in Figure 3. 


3.3 Initializing the TPM 


To begin using a TPM, the user (or administrator) must 
first take ownership of it. Taking ownership of the TPM 
establishes three important AuthData values: the owner 
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AuthData value, which is needed to set TPM policy, the 
SRK AuthData value, which is needed to use the SRK, 
and tpomProof. tpmProof is generated internal to the 
TPM and stored in NVRAM. It is never present in unen- 
crypted form outside the TPM. 

While it is easy for a professional administrator to 
take ownership of a TPM securely, taking ownership of 
a TPM is a security critical operation that is exposed in 
a very unfriendly way to average users. For example, 
Microsoft’s BitLocker full-disk encryption software uses 
the TPM. When a user initializes BitLocker, it reboots the 
machine into a BIOS-level prompt where the user is pre- 
sented cryptic messages about TPM initialization. Bit- 
Locker performs the initial ownership of the TPM, and it 
acquires privilege to do so with TPM mechanisms for as- 
serting physical presence at the platform via the BIOS. An 
inexperienced user could probably be convinced to agree 
to allow assertion of physical presence by malware similar 
to how rogue programs convince users to install malicious 
software and input their credit card numbers [44]. The 
function of the TPM is complicated and flexible, making 
a simple explanation of it for an average user a real chal- 
lenge. 

Furthermore, malware could also gain use of phys- 
ical presence controls in BIOS by attacks that modify 
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a) ss SRK 
AIK bind 
b) pk 


PK || Enc( PK 
AIK 


SK) 


PK || Enc( PK SK 
SRK» AIK bind 


SRK > ~ bind 
Figure 3: The part of the TPM key hierarchy relevant to our 
attack. The TPM box illustrates keying material stored inter- 
nal to the TPM, which is only the endorsement key (EK) and 
storage root key (SRK). Part (a) shows the conceptual key hi- 
erarchy, while part (b) shows how the secret keys of children 
are encrypted by the public keys of their parents so keys can be 
safely stored in memory. More detail on key formats is found in 
Figure 2. 


the BIOS itself [48]. Recent work has even demon- 
strated attacks against BIOS update mechanisms that re- 
quire signed updates [56]. 


3.4 Platform identity 


TPMs provide software attestation, a proof of what soft- 
ware is running on a platform when the TPM is invoked. 
The proof is given by a certificate for the current PCR 
values, which contain hashes of the initial state of all soft- 
ware run on the machine. This certificate proves to an- 
other party that a TPM-including platform is running par- 
ticular software. The receiver must be able to verify that 
the certificate comes from a legitimate TPM, or the quoted 
measurements or other attestations are meaningless. 

A user desiring privacy cannot directly use her plat- 
form’s EK for attestation. (EKs are linked to specific plat- 
forms, and additionally multiple EK uses can be corre- 
lated.) Instead, she can generate attestation identity keys 
(AIKs) that serve as proxies for the EK. An AIK can sign 
PCR contents to attest to platform state. However, some- 
thing must associate the AIK with the EK. 

A trusted privacy certificate authority (Privacy CA) 
provides certificates to third parties that an AIK corre- 
sponds to an EK of a legitimate TPM. While prototype 
Privacy CA code exists [27], Privacy CAs appear to be 
unused in practice. In our attack, the malware distributor 
acts as a Privacy CA and only trusts AIKs that it certifies. 

We emphasize that our proposed attack does not require 
or benefit from the anonymity guarantees provided by a 
Privacy CA. However, the TPM does not permit a user to 
directly sign an arbitrary TPM-generated public key with 
the EK, so our attack must use an intermediate AIK. 


3.5 Using the TPM 


Typical uses of the TPM are to manipulate the key hier- 
archy, to obtain signed certificates of PCR contents or of 
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authenticity of TPM data, and to modify PCRs to describe 
platform state as it changes. Keys are created in the key 
hierarchy by “loading” a parent key and commanding the 
TPM to generate a key below that parent, resulting in a 
new key blob. Loading a key entails passing a key blob to 
the TPM to obtain a key handle, which is an integer index 
into the currently loaded keys. Only loaded keys can be 
used for further TPM commands. Loading a key requires 
loading all keys above it in the hierarchy, so loading any 
key in the key hierarchy requires loading the SRK. 

The TPM can produce signed certificates of key authen- 
ticity. To do so, a user specifies a certifying key, and the 
TPM produces a hash of the public key for the key to be 
certified, along with a hash of a bitmask describing the se- 
lected PCRs and those PCR values, and signs both hashes 
with the selected key. 

PCRs can be modified by the TPM as platform state 
changes. They cannot be set directly, and are instead mod- 
ified by extension. A PCR with value PC’R extended by 
a 160-bit value val is set to value Extend(PCR, val) = 
H(PCR\|| val). Late launch extends the PCRs with the 
hash of the state of the program run in the late launch en- 
vironment. Thus the TPM can restrict access to keys to a 
particular program. Our malware protocol uses this abil- 
ity to prevent analyst use of a payload decryption key. 


3.6 TPM functionality evolving and best practices 
unknown 


Despite the widespread availability of trusted computing 
technology as embodied by the TPM, its implications are 
not well understood. The specification for the TPM and 
supporting software is complicated; version 1.2 of the 
TPM specification for the PC/BIOS platform with accom- 
panying TCG Software Stack is over 1,500 pages [52]. 
Additionally, there are few guidelines for proper use of 
its extensive feature set. It is quite believable that such 
a complicated mechanism has unintended consequences 
that undermine its security goals. In this paper, we pro- 
pose such a mechanism: that the TPM can be used as a 
means to thwart analysis of malware. 


Key hierarchy The lack of guidance on the usage of 
TPM capabilities makes it difficult to determine what in- 
formation an attacker might reasonably acquire. For ex- 
ample, the key hierarchy has a single root. Therefore, dif- 
ferent users must share at least one key, and every use of 
a TPM key requires loading the SRK. Loading the SRK 
requires SRK AuthData, and thus the SRK AuthData is 
likely well-known, making it possible for users to imper- 
sonate the TPM, as others have previously indicated [21]. 


EK certificates As another example of capabilities in 
flux, EK certificates critical to identifying TPMs as legiti- 
mate are not always present, and it is not always clear how 
to verify those that are. TPM manufacturers are moving 
toward certifying TPMs as legitimate by including certifi- 
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cates for EKs in TPM NVRAM. Infineon gives the most 
detail on their EK certification policy, in which the cer- 
tificate chain extends back to a new VeriSign TPM root 
Certificate Authority [11]. ST Microelectronics supplies 
TPMs used in many workstations from Dell. They state 
that their TPMs from 2010 onward contain certificates [9]. 
While no certificates were present on our older machines, 
we did find certificates for our newer Dell machines and 
manually verified the legitimacy of the EK certificate for 
one of our TPMs (which we describe further in Section 5). 


Protecting AuthData Many uses of the TPM allow Au- 
thData to be snooped if not used carefully. For example, 
standard use of TPM tools with TrouSerS prompts the 
user to enter passwords at the keyboard to use TPM ca- 
pabilities. These passwords can be captured by a keylog- 
ger if the system is compromised. Thus, despite that TPM 
commands may not require AuthData to appear, entry of 
this data into the system for usage can be insecure. 


4 Malware using cloaked computations 


We now describe an architecture and protocol for launch- 
ing a TPM-cloaked attack. 

Our protocol runs between an Infection Program, 
which is malware on the attacked host, and a Malware 
Distribution Platform, which is software executed on 
hardware that is remote to the attacked host. The goal 
of the protocol is for the Infection Program to generate a 
key. The Infection Program attests to the Malware Distri- 
bution Platform that TPM-based protection ensures only it 
can access data encrypted with the key. The Malware Dis- 
tribution Platform verifies the attestation, and then sends 
an encrypted program to the Infection Program. The In- 
fection Program decrypts and executes this payload. This 
protocol enables long-lived and pernicious malware, for 
example, turning a computer into a botnet member. The 
Infection Program can suspend the OS (and all other soft- 
ware) through use of processor late launch capabilities to 
ensure unobservability when necessary, like when the ma- 
licious payload is decrypted and executing. 


4.1 Late launch for secure execution 


The protocol uses late launch to suspend the OS to allow 
decryption and execution of the malicious payload with- 
out observation by an analyst. Late launch creates an exe- 
cution environment where it is possible to keep code and 
data secret from the OS. 

Late launch transfers control to a designated block of 
user-supplied code in memory and leaves a hash of that 
code in TPM PCRs. Specifically, with Intel’s Trusted Ex- 
ecution Technology, a user configures data structures to 
describe the Measured Launch Environment (MLE), the 
program to be run (which resides completely in mem- 
ory). She then uses the GETSEC[SENTER ] instruction 
to transfer control to chipset-specific code, signed by In- 
tel, called SINIT that performs pre-MLE setup such as 
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Pack (data, extra, PK): 


1. Generate symmetric key 

2. Asymmetric encrypt K to form Enc(PK, K || extra) 

3. Symmetric encrypt data to form EncSym(K, data) 

4. Output EncSym(K, data) || Enc(PK, K || extra) 

Unpack(EncSym(K, data) || Enc(PK, K || extra), SK): 

1. Asymmetric decrypt Enc(PK, K || extra) with SK to 
obtain K and extra 

2. Symmetric decrypt EncSym(K, data) with K to obtain 
data 

3. Output data, extra 


Figure 4: Subroutines used in main protocol. extra is needed 
for TPMActivateIdentity, and can be empty (#). Run- 
ning Unpack on the TPM uses TPM_Unbind. 


ensuring correctness of MLE parameters. The exact func- 
tionality of SINIT is not known, as its source code is not 
public. SINIT then passes control to the MLE. When the 
MLE runs, no software may run on any other processor 
and hardware interrupts and DMA transfers are disabled. 
To exit, the MLE uses the GETSEC [ SEXIT ] instruction. 


4.2 Malware distribution protocol 


The Infection Program first establishes a proof that it is 
using a legitimate TPM. It uses the TPM to generate two 
keys. One is a “binding key” that the Malware Distribu- 
tion Platform will use to encrypt the malicious payload. 
The other is an AIK that the TPM will use in the Privacy 
CA protocol, where the Malware Distribution Platform 
plays the role of the Privacy CA. The Malware Distribu- 
tion Platform will accept its own certification that the AIK 
is legitimate in a later phase. As stated before, the Privacy 
CA protocol enables indirect use of the private EK only 
kept by the TPM. A valid private EK cannot be produced 
by an analyst; it is generated by a TPM manufacturer and 
only accessible to the TPM hardware. This part of the 
Infection Program is named “Infection Keygen”. 

Our description of the protocol steps will elide lower- 
level TPM authorization commands like TPM_OTAP and 
TPM_OSAP that are used to demonstrate knowledge of au- 
thorization data and prevent replay attacks on TPM com- 
mands. 

We use subroutines Pack(data,extra,PK) and 
Unpack(data, Pi), which use asymmetric keys with in- 
termediate symmetric keys. Symmetric keys increase the 
efficiency of encryption, are required for certain TPM 
commands, and circumvent the limits (due to packing 
mechanisms) on the length of asymmetrically encrypted 
messages. These subroutines are shown in Figure 4 and 
the main protocol is in Figure 5. 


4.3 Analyzing the resilience of the protocol 


A malware analyst can attempt to subvert the protocol by 
tampering with data or introducing keys under her control. 
We now analyze the possibilities for subversion. 
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Infection Keygen: Generate binding key that Malware Distribution Platform will eventually use to encrypt malicious payload, AIK 
that certifies it, and request for Malware Distribution Platform to test AIK legitimacy 


1. 


4. 


Create binding keypair (PK, SK) ping under the SRK with 
TPM.CreateWrapKey(SRK, PC R18 = Extend(0160, H (Infection Payload Loader) )) (requires SRK AuthData), store in 
memory 


. Create identity key (PK, SK) arx under SRK in memory as Blob((PK, SK) arx) with TPM MakeIdentity (requires 


owner AuthData) 


. Retrieve EK certificate Cex = PKegx || Sign(SK manufacturer, 1(PKex)), which certifies that the TPM with that EK is 


legitimate (requires owner AuthData to obtain from NVRAM with TPM_NV_ReadValue from EK index or needs to be on 
disk already) 
Send M;eq = PubBlob((PK, SK) ark) || Cex to Malware Distribution Platform as a request to link AIK and EK 


Malware Distribution Platform Certificate Handler: Give Infected Platform credential only decryptable by legitimate TPM 


1. 
2. Verify Sign(S Kmanufacturer, H(PKex)) with manufacturer CA public key 

3. 

4. Sign Haik.cert With SK matware, private key known only to the Malware Distribution Platform whose corresponding public 


6. 


Receive Mreq 
Generate hash Haikcert = H(PubBlob((PK, SK) ark)) 


key is known to all, to form Sign(SKmatware, Haikcert). Sign(S Kmatware, Haik.cert) is a credential of AIK legitimacy. 


. Run Pack(Sign(S Kmatware, Haik.cert); Haik.cert,; PK ex) to form 


Mreqresp = Enc(PKex, K2|| Haiz.cert) || EncSym(Ko, Sign(S Kmatware; Haik.cert)). Mreq.resp contains the credential 
in a way such that it can only be extracted by a TPM with private EK Sex when the credential was created for an AIK 
stored in that TPM. 

Send Mreq-resp to Infected Platform 


Infection Proof: Decrypt credential, assemble certificate chain from manufacturer certified EK to binding key (including credential) 


1. 
ee 
3: 


Receive Mregresp 

Load AIK (PK, SK) arx and binding key (PK, S'K)pinag with TPM-LoadKey2 

Use TPMActivateIdentity, which decrypts Enc(PKex, K2 || Hair-cert) and retrieves K2 after comparing Haikcert 
to that calculated from loaded AIK located in internal TPM RAM. If comparison fails, abort. (requires owner AuthData) 


. Symmetric decrypt EncSym(Ko, Sign( SK matware, Haikcert)) to retrieve Sign(S Kmatware; Haik.cert) 
. Certify (PK, SK)ping with TPM_CertifyKey to produce 


Sign(SKuark, H(PCRs(PubBlob((PK, SK )bina))) || H(PKoina)) — Sign(SKuark, Hpbindcert) 


. Send Mproof = Sign(SK malware; Hoaikcert) || PubBlob((PK, SK) ark) || Sign(SK ark, Hoind.cert) || 


PubBlob((PK, SK )pina), all the evidence needed to verify TPM legitimacy, to Malware Distribution Platform 


Malware Distribution Platform Payload Delivery: Verify certificate chain, respond with encrypted malicious payload if successful 


1. 
2. 


4. 


5. 


Receive Mproof 

Verify signatures of Hoizcert by SKmatware using PK matware, Of Hbind.cert using PK ar«. Check that Hpindcert 
corresponds to the binding key by comparing hash of public key, PCRs to PubBlob((PK, SK )sina). Use 
PubBlob((PK, SK )bina) to determine if binding key has a proper constraint for PC'R18. Abort if verification fails or 
binding key improperly locked. 


. Hash and sign the payload with SKmatware to form Sign(S Kmaltware, H (payload)) (only needs to be done once per 


payload) 

Run Pack(payload || Sign(S K matware, H (payload)), 6, PKvina) to form 

Mpayload = EncSym(K3, payload || Sign(S Kmatware, H (payload))) || Enc(P Koina, K3) 
Send Mpaytoaa to Infected Platform 


Infection Payload Execute: Use late launch to set PCRs to allow use of binding key for decryption and to prevent OS from 
accessing this key during use 


1. 
2s 


Receive Mpayload 
Late launch with MLE = Infection Payload Loader 


Infection Hidden Execute: Infection Payload Loader decrypts and executes the payload in the late launch environment. 


1. 
2. 


aN 


Load (PK, SK)bina with TPM_LoadKey2 
Run Unpack(Mpaytoaa, SKbina). This operation can succeed (and only in this program) because in Infection Hidden 
Execute, PC R18 = Extend(0160, H (Infection Payload Loader) ). Obtain payload || Sign(S Kmatware, H(payload)). 


. Verify signature Sign(S Kmatware, H(payload)) with PK matware. Abort if verification fails. 
. Execute payload 
. If return to OS execution is desired, scrub payload from memory and extend random value into PCR18, then exit late launch 


Figure 5: The cloaked malware protocol. 
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key handle = TPM_LoadKey2(key blob) 
key blob = TPM_MakeIdentity() 


sym_key = 
(certificate, signature) = 


TPM_CertifyKey(certifying key handle, key handle) 
value = TPM_NV_ReadValue(index) 





key blob = TPM_CreateWrapKey(parent key, PCR constraints) 


TPM_ActivateIdentity( identity key handle, CA response) 


Generate new key with PCR constraints under the par- 
ent key in hierarchy. The resultant key may be used for 
encryption and decryption, but not signing. 

Load a key for further use. 

Generate an identity key under SRK that may be used 
for signing, but not encryption and decryption. 

Verify that asymmetric CA response part corresponds 
to identity key. If agreement, decrypt response and re- 
trieve enclosed symmetric key. 

Produce certificate of key contents. Sign certificate with 
certifying key. 

Retrieve data from TPM NVRAM. 








Table 2: Additional functions in the main protocol. Keywords that are in fixed-width font that begin with TPM_ are TPM commands 


defined in the TPM 1.2 specification. 


The analyst’s goal is to cause the malicious payload to 
be encrypted with a key under her control, or to observe a 
decrypted payload. She could try to create a binding key 
blob during Infection Proof, and certify it with a legiti- 
mate TPM. However, the analyst does not know the value 
of tpmProof for any TPM because it is randomly gen- 
erated within the TPM and is never present (even in en- 
crypted form) outside the TPM. Without tpm Proof, the 
analyst cannot generate a key blob that the TPM will cer- 
tify, even under a legitimate AIK. This argument relies on 
the fact that the encryption system is non-malleable [25] 
and chosen ciphertext secure. Otherwise, an attacker 
might be able to take a legitimately created ciphertext with 
tpmProof in it and modify it to an illegitimate ciphertext 
with tpmProof in it, without knowing tpmProof. 


The analyst could attempt to modify PCR constraints 
on the binding key by tampering with the the public part 
of the key. However, the TPM will not load the key in the 
modified blob because a digest of the public portion of the 
blob will not match the hash stored in the private portion. 
Thus, storing the binding key in the public part of the blob 
where it is accessible to the analyst does not compromise 
the security of the protocol. If the binding key is a legiti- 
mate TPM key with PCR constraints that do not lock it to 
being observed only during Infection Hidden Execute, 
the Malware Distribution Platform will detect it during 
Malware Distribution Platform Payload Delivery, and 
the platform will not encrypt the payload with that key. 


The analyst could attempt to forge keys at other points 
in the hierarchy: she could attempt to certify a binding 
key she creates with an AIK that she creates. The Mal- 
ware Distribution Platform only obtains the public por- 
tions of these key blobs, and so cannot check directly in 
Malware Distribution Certificate Handler that the AIK 
is legitimate. The Malware Distribution Platform could 
not verify the legitimacy of key blobs even with their pri- 
vate portions as the Platform can neither decrypt the pri- 
vate portions, nor know the value of tpmProof for the 
Infected Platform. However, it encrypts with the EK a 
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credential that is a signed hash of the AIK it is sent by In- 
fection Keygen running on an infected platform. The EK 
is proven legitimate by a certificate of authenticity signed 
by the TPM manufacturer’s private key and verified by the 
Malware Distribution Platform. The private EK is only 
stored internal to the TPM, and only usable under con- 
trolled circumstances like TPMActivateIdentity; 
to our knowledge, there is no way to compel the 
TPM to decrypt arbitrary data with the private EK. 
TPM ActivatelIdentity will only decrypt a public 
EK-encrypted blob of the form (K || Haikcert) where 
Al aik.cert iS the hash of the public portion of an AIK key 
blob where the AIK has been loaded into the TPM (and 
thus has not been tampered with). Therefore, K cannot 
be recovered for an illegitimate AIK, and the credential 
Sign(S Kinaltware, Haik-cert) cannot be recovered. With- 
out this credential, the protocol will abort in Malware 
Distribution Platform Payload Delivery (step 2). The 
credential cannot be forged as it contains a signature with 
a private key known only by the Malware Distribution 
Platform. 


The analyst could try to execute forged payloads with 
Infection Hidden Execute because the public binding 
key is visible. However, because Infection Hidden Exe- 
cute will only execute payloads signed by a key unknown 
to the analyst, this will not work. No program other than 
Infection Hidden Execute and the programs it executes 
can access the binding key. 


The analyst could try to set the PCR values to those 
specified in (PK,SK)yina, but run a program other 
than Infection Payload Loader. This would allow her 
to decrypt the payload (step 2 in Infection Hidden Ex- 
ecute). The values of PCRs are affected by processor 
events and the SINIT code module. The CPU instruction 
GETSEC[SENTER] sends an LPC bus signal to initial- 
ize the dynamically resettable TPM PCRs (PCRs 16-23) 
to 160 bits of Os. No other TPM capability can reset these 
PCRs to all Os; a hardware reset sets them to all 1s. So an 
analyst can only set PCR 18 to all Os with a late launch 
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executable. SINIT extends PCR 18 with a hash of the 
MLE. Therefore, to set PCR 18, the analyst must run an 
MLE with the correct hash. Assuming the hash function is 
collision resistant, only the Infection Payload Loader will 
hash to the correct value, so the analyst cannot run an al- 
ternate program that passes the PCR check. The payload 
loader terminates at payload end by extending a random 
value into PCR 18, so the analyst cannot use the key after 
the late launch returns. 


4.4 Prevention of malware analysis 


Having described our protocol for cloaked malware ex- 
ecution, we review how it defeats conventional malware 
analysis. While our list of malware analysis techniques 
may not be exhaustive, to our knowledge, TPM cloaking 
can be defeated only by TPM manufacturer intervention, 
or by physical attacks, like direct monitoring of hardware 
events or tampering with the TPM or system buses. Both 
of these are discussed in more detail in Section 6. 

Static analysis. Cloaked computations are encrypted 
and are only decrypted once the TPM has verified that the 
PCRs match those in the key blob. The malware author 
specifies PCR values that match only the Infection Pay- 
load Loader, so no analyst program can decrypt the code 
for a cloaked computation. 

Honeypots. Honeypots are open systems that collect 
and observe malware, possibly using some combination 
of emulation, virtualization and instrumented software. 
Purely software-based honeypots can try to follow our 
protocol without using a legitimate hardware TPM, but 
will fail to convince a malware distributing machine of 
their authenticity. This failure is due to their inability to 
decrypt Enc(PK gx, K2 || Haikcert), which is encrypted 
with the public EK that is certified by a TPM manufac- 
turer in Cgx, and the private part of which is not present 
outside of a TPM. Thus these honeypots will never re- 
ceive the malicious payload. If a honeypot uses a legit- 
imate hardware TPM, it will obtain a malicious payload. 
However, it can only execute the payload with late launch, 
which prevents software monitoring of the unencrypted 
payload. 

Virtualization. Software-based TPMs, virtualized 
TPMs, and virtual machine monitors communicating with 
hardware TPMs cannot defeat cloaking. Hardware TPMs 
have certificates of authenticity that are verified in our 
malware distribution protocol. A software-based TPM ei- 
ther will not have a certificate, or will have a certificate 
that is distinguishable from a hardware TPM. Either way, 
it will fail to convince a malware distribution platform of 
its authenticity. An analyst cannot use a virtual machine 
to defeat cloaking. 

Hardware TPM manufacturers should not certify 
software-based TPMs as authentic hardware TPMs. 
Software-based TPMs cannot provide the same secu- 
rity guarantees as hardware-based TPMs. The PCRs of 
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software-based TPMs might not correspond to platform 
state in any way, as they can be modified by sufficiently 
privileged software. A software TPM cannot attest to a 
particular software environment, because it does not know 
the true software environment —it could be executing in a 
virtual environment. Any certificate for a software-based 
TPM must identify the TPM as software otherwise the 
chain of trust is broken, defeating remote attestation (a 
major purpose of TPMs). No TPM manufacturer cur- 
rently signs software TPM EKs, nor (to our knowledge) 
do any plan to do so. Prior work on virtualizing TPMs 
emphasizes that virtual TPMs and their certificates must 
be distinguishable from hardware TPMs, as the two do 
not provide the same security guarantees [17]. A malware 
distribution platform can avoid software and virtual TPM 
certificates by using a whitelist of known-secure hardware 
TPM certificate distributors compiled into the malware. 

Software, such as a virtual machine monitor, cannot 
communicate with a legitimate hardware TPM to obtain 
and decrypt the malicious payload without running the 
payload in late launch. The only way that the mali- 
cious payload can be decrypted is through use of a private 
key stored in the TPM that can only be used when the 
TPM PCRs are in a certain state. This state can only be 
achieved through late launch, which is a non-virtualizable 
function, and it prevents software monitoring of the unen- 
crypted payload. TPM late-launch is designed to be non- 
virtualizable, so that TPM hardware can provide a com- 
plete and reliable description of platform state. 


4.5 Attack assumptions 


Like any attack, ours has particular assumptions. As dis- 
cussed in Section 2.1, our protocol requires late launch 
instructions, which are privileged, so Infection Hidden 
Execute must run at kernel privilege levels. 

More importantly, our attack requires knowledge of 
SRK and owner AuthData values. There are two main 
possibilities for acquiring this AuthData previously men- 
tioned in Section 3: snooping and overriding with physi- 
cal presence. 

AuthData can be snooped from kernel or application 
(e.g. TrouSerS) memory or from logged keystrokes, 
which are converted into AuthData by a hash. The like- 
lihood of successful AuthData snooping depends on the 
particular AuthData being gathered. The SRK must be 
loaded to load any other key stored in the TPM, so there 
will be regularly occurring chances to snoop the SRK Au- 
thData. Owner AuthData, on the other hand, is required 
for fewer, and generally more powerful, operations. It is 
then liable to be more difficult to acquire. 

One could enter all AuthData remotely to a platform 
that contains a TPM, but we consider it unlikely that this 
is done in practice. TPM arguments could be HMACed 
by a trusted server, but such a server can become a perfor- 
mance or availability bottleneck. Use of a trusted server 
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is also problematic for use of laptops that may not always 
have network connectivity. For these cases, it may be pos- 
sible to enter AuthData into a separate trusted device that 
then can assist in authorizing TPM commands. However, 
such devices are currently not deployed. It is currently 
more likely that AuthData would be presented through a 
USB key or entered at the keyboard, and in both cases it 
can be snooped. In addition, applications and OS services 
used to provide AuthData to the TPM may not sufficiently 
scrub sensitive data from memory. 

To demonstrate the possibility of acquiring AuthData 
from the OS, we virtualized a Windows 7 instance, and 
used OS-provided control panels to interact with the 
TPM. When AuthData was read from a removable drive, 
it remained in memory for long periods of time on an idle 
system, even after the relevant control panels were closed. 
The entire contents of the file containing the AuthData 
were present in memory for up to 4 hours after the Auth- 
Data was read, and the removable drive ejected from the 
system. The AuthData itself remained in memory for sev- 
eral days, before the system was eventually shut down. 

If malware can use mechanisms for asserting physical 
presence at the platform, it can clear the current TPM 
owner and install a new owner, preventing the need to 
snoop any AuthData. While physical presence mecha- 
nisms should be tightly controlled, their implementation 
is left up to TPM and BIOS manufacturers. Our experi- 
ence setting up BitLocker (see Section 3.3) indicates that 
the process can be confusing, and that it may be possible 
to convince a user to enable malware to obtain the neces- 
sary authorization to use TPM commands. 


4.6 Distributing the malware distribution platform 


AS written, the malware distribution platform consists of a 
host (or small number of hosts) controlled by the attacker 
and trusted with the attacker’s secret key (S Kynatware): 
This design creates a single point of failure. 

The Malware Distribution Platform computation con- 
sists of arithmetic and cryptographic work (with no OS 
involvement) with an embedded secret. It is a perfect can- 
didate to run as a cloaked computation. An attacker can 
distribute work done on the Malware Distribution Plat- 
form to compromised hosts using cloaked computations. 


5 Implementation and Evaluation 


We implemented a prototype of our attack, which con- 
tains implementations of the establishment of a TPM- 
controlled binding key, the decryption and execution of 
payloads in late launch, and sample attack payloads. In 
this section, we describe each of these pieces in turn. 

The prototype implementation consists of five pro- 
grams for the key establishment protocol (described in 
Table 3), the Infection Payload Loader PAL and ported 
TrouSerS TPM utility code, payload programs, and sup- 
porting code to connect the pieces. The key establish- 
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ment programs are about 3,600 lines of C, the Infection 
Payload Loader is another 400 lines of C, with another 
150 lines of C added to provide TPM commands through 
selections of TrouSerS TPM code which themselves re- 
quired minor modifications. The payloads were about 50 
lines apiece with an extra 75 line supporting DSA rou- 
tine, which was necessary for verifying Ubuntu’s reposi- 
tory manifests. All code size measurements are as mea- 
sured by SLOCCount [53]. 


5.1 Binding key establishment 


We implemented a prototype of the protocol described in 
Figure 5 using the TrouSerS [6] (v0.3.6) implementation 
of the TCG software stack (TSS) to ease development. 

Our implementation follows the protocol, except 
steps 2 to 3 in Infection Keygen which use TSS 
API call Tspi_CollateIdentityRequest. This 
call does not produce M,., (step 4), but instead 
produces EncSym(k, PubBlob((PK,SK)a4rk)) and 
Enc(PKmatware, &) that must be decrypted in the Mal- 
ware Distribution Platform Certificate Handler. While the 
protocol specifies network communication, the prototype 
communicates via files on one machine. TrouSerS is not 
necessary for malware cloaking; TPM commands made 
by TrouSerS could be made directly by malware. 


5.1.1 EK certificate verification 


We verified the authenticity of our ST Microelectronics 
TPM endorsement key (EK). However, we had to over- 
come obstacles along the way, and there may be obstacles 
with other TPM manufacturers as well. For example, we 
needed to work around unexpected errors in reading the 
EK certificate from TPM NVRAM. Reads greater than or 
equal to 863 bytes in length return errors, even though the 
reads seem compatible with the TPM specification, and 
the EK certificate is 1129 bytes long. We read the certifi- 
cate with multiple reads, each smaller than 863 bytes. 
The intermediate certificates in the chain linking the 
TPM to a trusted certificate authority were not available 
online, and we obtained them from ST Microelectronics 
directly. However, some manufacturers (e.g. Infineon) 
make the certificates in their chains available online [11]. 
To deploy TPM-based cloaking on a large scale, the veri- 
fication process for a variety of TPMs should be tested. 
For the TPM we tested, the certificate chain was of 
length four including the TPM EK certificate and rooted 
at the GlobalSign Trusted Computing Certificate Author- 
ity. There were two levels of certificates within ST Mi- 
croelectronics: Intermediate EK CA 01 (indicating there 
are likely more intermediate CAs) and a Root EK CA. 


5.2 Late launch environment establishment 


We modified code from the Flicker [40] (v0.2) distribution 
to implement our late launch capabilities. Flicker pro- 
vides a kernel module that allows a small self-contained 
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program, known as a Piece of Application Logic or PAL, 
to be started in late launch with a desired set of parameters 
as inputs in physical memory. The kernel module accepts 
a PAL and parameters through a sysfs filesystem in- 
terface in Linux, then saves processor context before per- 
forming a late launch, running the PAL in late launch, and 
then restoring the processor context after the PAL com- 
pletes. Output from PALs is available through the filesys- 
tem interface when processor context is restored. 

We implemented the Infection Payload Loader as a 
PAL, which takes the encrypted and signed payload, the 
symmetric key used to encrypt the payload encrypted with 
the binding key, and the binding key blob as parameters. 
We used the PolarSSL [15] embedded cryptographic li- 
brary for all our cryptographic primitives (AES encryp- 
tion, RSA encryption and signing, SHA-1 hashing and 
SHA-1 HMACs). 

We ported code from TrouSerS to handle use of 
TPM capabilities that were not implemented by the 
Flicker TPM library (TPM.OIAP, TPM_LoadKey2, 
TPM_Unbind). We replaced the TrouSerS code depen- 
dence on OpenSSL with PolarSSL. We fixed two small 
bugs in Flicker’s TPM driver that seem to be absent from 
the recent 0.5 release due to use of an alternate driver. 


5.3. Payloads 


We implemented payloads for the three examples from 
Section 2.2. Here we describe the payloads in detail. 


Domain generation The domain generation payload 
provides key functionality for a secure command and con- 
trol scheme, in which malware generates time-based do- 
main names unpredictable to an analyst. As input, the 
payload takes the contents of a package release manifest 
for the Ubuntu distribution, and its associated signature. 
The payload verifies the signature against a public key 
within itself. If the signature verifies correctly, the pay- 
load extracts the date contained in the manifest. The pay- 
load outputs an HMAC of the date with a secret key con- 
tained in the encrypted payload. 

Assuming an analyst is unable to provide correctly 
signed package manifests for future dates, this payload 
provides a secure random value unpredictable to an ana- 
lyst, but generatable in advance by the payload’s author 
(because the author knows the secret HMAC key). Such 
a random value can be used as a seed in a domain genera- 
tion scheme similar to that of the Conficker worm. 


Data exfiltration The data exfiltration payload searches 
for sensitive data (we looked for credit card numbers), and 
returns results in encrypted form. To avoid analysis by 
correlating input with the presence or absence of output, 
the payload generates some output regardless of whether 
sensitive data is present in the file. 


Timebomb This payload implements key cloaked func- 
tionality necessary for a timed DDoS attack that keeps the 
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target and time secret until the attack begins. Like the do- 
main generation payload, it uses signed package release 
manifests to establish an authenticated current timestamp. 
Once the payload has verified the signature on the mani- 
fest, it extracts the date. If the resultant date is later than 
a value encoded in the encrypted payload, it releases the 
time-sensitive information as output. This payload out- 
puts a secret AES key contained in the encrypted payload. 
The key can be used to decode a file providing further in- 
structions, such as the DDoS target, or a list of commands. 


5.4 Evaluation 


We tested our implementation on a Dell Optiplex 780 with 
a quad-core 2.66 Ghz Intel Core 2 CPU with 4 GB of 
RAM running Linux 2.6.30.5. We used a ST Microelec- 
tronics STI9NP18 TPM, which is TCG v1.2 compliant. 
Elapsed wallclock times for protocol phases are indicated 
in Table 4. We used 2048-bit RSA encryption and 128-bit 
AES encryption. The malicious payloads varied in size 
from 2.5 KB for the command and control to 0.5 KB for 
the text search. 

















Costs for infecting a machine 

Action Time (s) 
Infected Platform generates binding key 19.4 + 11.2 
Infected Platform generates AIK and credential | 31.6 + 17.9 
request 

Malware Distribution Platform processes re- 0.07 + 0 
quest 

Infected Platform certifies key 5.9 + 0.012 
Infected Platform decrypts credential 6.0 + 0.010 
Malware Distribution Platform verifies proof 0.04 + 0 
Total 63.1 + 22.2 
Per-payload execution statistics Time (s) 
MLE setup 1.05 + 0.01 
Time to decrypt payload 3.07 + 0.01 
Command and Control 0.008 = 0 
DDoS Timebomb 0.008 = 0 
Text Search 0.004 + 0 
Time system appears frozen 3.22 
Total MLE execution time 4.27 








Table 4: Performance of different phases. Error bars are stan- 
dard deviations of sample sets. A standard deviation of “0” indi- 
cates less than | ms. Statistics for the protocol up to late launch 
were calculated from 10 protocol cycles run one immediately af- 
ter the other, while late launch payload statistics were calculated 
from 10 other runs per payload, one immediately after the other. 


The main performance bottleneck is TPM operations, 
especially key generation. We verified that the significant 
and variable duration of key generation was directly due 
to underlying TPM operations. The current performance, 
one minute per machine infection, allows rapid propaga- 
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tpm_certify 


Infected Platform 








Program Purpose 
tpm_genkey Generates the binding key and output key blob to a file. 
aik_gen Generates an AIK and accompanying certification re- 


quest. Outputs key blob and request to files. 
Certifies the binding key under the AIK. 


infected Two modes: proof which generates a proof of authen- 
ticity to convince the Malware Distribution Platform to 
distribute an encrypted payload and payload which 
loads the binding key and decrypts the payload. 

platform Two modes: req which handles a request from the In- 


fected Platform and returns an encrypted credential and 
proof which validates a proof of authenticity from the 


Correspondence to Protocol 
Infection Keygen step | 
Infection Keygen steps 2— 4 


Infection Proof step 5 

proof: Infection Proof steps 1-4 and 
6, payload: Infection Hidden Exe- 
cute steps 1-3 


req: Malware Distribution Platform 
Certificate Handler, proof: Mal- 
ware Distribution Platform Payload 
Delivery 








Table 3: Programs that comprise the key establishment part of the implementation and their functions. 


tion of malware (hosts can be compromised concurrently). 

Performance is most important for operations on the 
Malware Distribution Platform, which may have to ser- 
vice many clients in rapid succession, and in the final 
payload decryption, as it occurs in late launch with the op- 
erating system suspended. The payload decryption must 
occur per payload execution, which in our motivating sce- 
narios will be at least daily. The slowest operation on the 
Malware Distribution Platform can handle tens of clients 
per second with no optimization whatsoever. 

We provide several numbers that characterize late 
launch payload performance. The MLE setup phase of 
the Flicker kernel module involves allocation of memory 
to hold an MLE and configures MLE-related structures 
like page tables used by SINIT to measure the MLE. The 
Flicker module then launches the MLE, which in our case 
contains the Infection Payload Loader PAL. This PAL first 
decrypts the payload, which occupies most MLE execu- 
tion time for our experiments. The payload runs, the MLE 
exits, and the kernel module restores prior system state. 

The late launch environment execution can be as long 
as 3.2 s, which is long enough that an alert user might no- 
tice the system freeze (since the late launch environment 
suspends the OS) and become suspicious. Then again, 
performance variability is a hallmark of best-effort operat- 
ing systems like Linux and Windows. The rootkit control 
program can use heuristics to launch the payload when 
the platform is idle or the user is not physically present. 

Payload decryption performance is largely based on the 
speed of asymmetric decryption operations performed by 
the TPM. The use of TPM key blobs here involves two 
asymmetric decryption operations, one to allow use of 
the private portion of the key blob (which is stored in 
encrypted form), and one to use this private key for de- 
crypting an encrypted symmetric key. Symmetric AES 
decryption took less than 1% of total payload decryption 
time in all cases, and is unlikely to become more costly 
even with significant increases in payload size: We found 
that a 90 KB AES decryption with OpenSSL (36x larger 
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than our largest payload), took only 650 microseconds. 


6 Defenses 


We now examine defenses against the threat of using 
TPMs to cloak malware. We present multiple potential 
directions for combating this threat. In general, we find 
that there is no clear “silver bullet” and many of the pro- 
posed solutions require tradeoffs in terms of the security 
or usability of the TPM system. 


6.1 Restricting late launch code 


One possibility would be to restrict the code that can be 
used in late launch. For example, a system could im- 
plement a security layer to trap on SENTER instructions. 
With recent Intel hardware, a hypervisor could provide 
admission control, gaining control whenever SENTER is 
issued and protecting its memory via Extended Page Ta- 
ble protections. The hypervisor could enforce a range of 
policies with its access to OS and user state. For example, 
the TrustVisor [39] hypervisor likely enforces a policy to 
deny all MLEs since its goal is to implement an indepen- 
dent software-based trusted computing mechanism. 

Restricting access to the hardware TPM is one of the 
best approaches to defending against our attack, but such 
a defense is not trivial. Setup and maintenance of this 
approach may be difficult for a home or small business 
user. Use of a security layer is more plausible in an enter- 
prise or cloud computing environment. In that setting, the 
complexity centers on policy to check whether an MLE is 
permitted to execute in late launch. The most straightfor- 
ward methods are whitelisting or signing MLEs. These 
raise additional policy issues about what software state to 
hash or sign, how to revoke hashes or keys, and how to 
handle software updates. Any such system must also log 
failed attempts and delay or ban abusive users. 

It is possible to use other system software to control ad- 
mission to MLEs. SINIT, which itself is signed by Intel, 
could restrict admission to MLEs since all late launches 
first transfer control to SINIT. However, this would re- 
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quire SINIT, which is low-level system software, to en- 
force access control policy. It would most likely do this by 
only allowing signed MLEs to run. There are then two op- 
tions: either MLEs must be signed by a key that is known 
to be trusted, or SINIT must also contain code for key 
management operations like retrieving, parsing, and vali- 
dating certificates. In the former case, the signing key is 
most likely to be from Intel; Intel chipsets can already ver- 
ify Intel-signed data [12]. However, this makes third party 
development more difficult; code signing is most effective 
when updates are infrequent and the signing party is the 
code developer. For late launch MLEs, it is quite possi- 
ble that neither will be the case. The latter case, having 
SINIT manage keys, is likely to be difficult to imple- 
ment, especially since SINIT cannot use OS services. 


6.2. TPM Manufacturer Cooperation 


A malware analyst could defeat our attack with the co- 
operation of TPM manufacturers. Our attack uses keys 
certified to be TPM-controlled to distinguish communi- 
cation with a legitimate TPM from an analyst forging re- 
sponses from a TPM. A TPM manufacturer cooperating 
with analysts and certifying illegitimate EKs would defeat 
our attack, by allowing the analyst to create a software- 
controlled late-launch environment. However, any leak of 
a certificate for a non-hardware EK would undermine the 
security of all TPMs (or at least all TPMs of a given man- 
ufacturer). Malware analysis often occurs with the coop- 
eration of government, academic, and commercial institu- 
tions, which raises the probability of a leak. 

Alternately, a manufacturer might selectively decrypt 
data encrypted with a TPM’s public EK on-line upon re- 
quest. Such a service would compromise the Privacy CA 
protocol at the point where the Privacy CA encrypts a 
credential with the EK for a target TPM-containing plat- 
form. The EK decryption service would allow an analyst 
to obtain a credential for a forged (non-TPM-generated) 
AIK. This is less dangerous than the previous situation, 
as now only parties that trust the Privacy CA (in our case 
the Malware Distribution Platform) could be mislead by 
the forged AIK. However, this approach also places ad- 
ditional requirements on the manufacturer, in that it must 
respond to requests for decryption once per Malware Dis- 
tribution Platform, rather than once per analyst. Addition- 
ally, the EK decryption service has potential for abuse by 
an analyst if legitimate Privacy CAs are deployed. 


6.3 Attacks on TPM security 


Cloaking malware with the TPM relies on the security of 
TPM primitives. A compromise of one or more of these 
primitives could lead to the ability to decrypt or read an 
encrypted payload. For instance, the exclusive access of 
late launch code to system DRAM is what prevents ac- 
cess to decrypted malicious payloads. A vulnerability in 
the signed code module that implements the late launch 
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mechanism (and enables this exclusive access) could al- 
low an analyst to read a decrypted payload [55]. 

Physical access to a TPM permits other attacks. Some 
TPM uses are vulnerable to a reset of the TPM without re- 
setting the entire system, by grounding a pin on the LPC 
bus [32]. Late launch, as used by our malware, is not vul- 
nerable to this attack. LPC bus messages can be eaves- 
dropped or modified [37], revealing sensitive TPM infor- 
mation. In addition, sophisticated physical deconstruc- 
tion of a TPM can expose protected secrets [51]. While 
TPMs are not specified to be resistant to physical attack, 
the tamper-resistant nature of TPM chips indicates that 
physical attacks are taken seriously. It is likely that phys- 
ical attacks will be mitigated in future TPM revisions. 

One potential analysis tool is a cold boot attack [29] 
in which memory is extracted from the machine during 
operation and read on a different machine. In practice 
the effectiveness of cold boot attacks will be tempered by 
keeping malicious computations short in duration, as it 
is only necessary to have malicious payloads decrypted 
while they are executing. Additionally, it may be possible 
to decrypt payloads in multiple stages , so only part of the 
payload is decrypted in memory at any one time. Mem- 
ory capture is a serious concern for data privacy in legit- 
imate TPM-based secure computations as well. It is im- 
portant for future trusted computing solutions to address 
this issue, and the addition of mechanisms that defend 
against cold boot attacks would increase the difficulty of 
avoiding our attack. 


6.4 Restricting deployment and use of TPMs 


Our attack requires that the malware platform knows SRK 
and owner AuthData values for the TPM. The danger of 
malware using TPM functionality could be mitigated by 
careful control of AuthData. Existing software that uses 
the TPM takes some care to manage these values. For 
instance, management software used in Microsoft Win- 
dows prevents the user from storing owner AuthData on 
the same machine as the TPM. Instead, it can be saved to 
a USB key or printed in hard copy. Administrators who 
need TPM functionality would ideally understand these 
restrictions and manage these values appropriately. Aver- 
age users will be more difficult to educate. 

The malware platform could initialize a previously 
uninitialized TPM, thereby generating the initial Auth- 
Data. For our test machines, TPM initialization is pro- 
tected by a single BIOS prompt that can be presented on 
reboot at the request of system software. To prevent an in- 
experienced user from initializing a TPM at the behest of 
malicious software, manufacturers could require a more 
involved initialization process. The BIOS could require 
the user to manually enter settings to enable system soft- 
ware to assert physical presence, rather than presenting a 
single prompt. More drastically, a user could be required 
to perform some out-of-band authentication (such as call- 
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ing a computer manufacturer) to initialize the TPM. How- 
ever, all of these security features inhibit TPM usability. 


6.5 Detection of malware that uses TPMs 


Traffic analysis is a common malware detection tech- 
nique. Malware that uses the TPM will cause usage pat- 
terns that might be anomalous and therefore could come 
to the attention of alert administrators. Of course detect- 
ing anomalous usage patterns is a generally difficult prob- 
lem, especially if TPM use becomes more common. 


7 Related Work 


Malware Analysis. TPM cloaking is a new method for 
frustrating static and dynamic analysis that is more pow- 
erful than previous methods because it uses hardware to 
prevent monitoring software from observing unencrypted 
code. The most effective analysis technique would be a 
variant on the cold boot attack [29], where the infected 
machine’s DRAM chips were removed during the late 
launch session. Note that a late launch session generally 
only lasts seconds. If the DRAM chips are pulled out too 
early, the payload will still be encrypted; too late and the 
payload is scrubbed out of memory. The analyst could 
also snoop the memory bus or the LPC bus. Note that 
both of these are hardware techniques, and they are both 
effective attacks against legitimate TPM use. 

Our protocol does run substantial malware outside the 
cloaked computation. All such malware is susceptible to 
static analysis [30, 47, 23], dynamic analysis [19, 58, 36], 
hybrids [24, 35] , network filtering [16, 49], and network 
traffic analysis [20]. To effectively use the TPM the mal- 
ware must only decrypt its important secrets within the 
cloaked computation. 

Polymorphic malware changes details of its encryption 
for each payload instance to avoid network filtering. Our 
system falls partially into the polymorphic group as we 
encrypt our payload. However dynamic analysis tech- 
niques [36] are effective against polymorphic encryption 
because such schemes must decrypt their payload during 
execution. Conficker as well as other modern malware use 
public key cryptography to validate or encrypt a malicious 
payload [43], as our cloaking protocol does. 

Trusted Computing. The TPM can be used in a vari- 
ety of contexts to provide security guarantees beyond that 
of most general-purpose processors. For instance, it can 
be used to protect encryption keys from unauthorized ac- 
cess, as in Microsoft’s BitLocker software [7], or to attest 
that the computer platform was initialized in some known 
state, as in the OSLO boot loader [32]. Flicker [40] uses 
TPM late launch functionality to provide code attestation 
for pieces of code that are instantiated by, and return to, a 
potentially untrusted operating system. Bumpy [41] uses 
late launch to protect sensitive input from potentially un- 
trusted system software. Our prototype malware platform 
uses the same functionality, adding encryption to conceal 
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the code payload. 

Cryptography. Using cryptography for data exfiltration 
was suggested by Young and Yung [59]. Bethencourt, 
Song, and Waters [18] showed how using singly homo- 
morphic encryption one could do cryptographic exfiltra- 
tion. However, the techniques were limited to a single 
keyword search from a list of known keywords and the use 
of cryptography significantly slowed down the exfiltration 
process. Using fully homomorphic encryption [28] we 
could achieve expressive exfiltration, however, the pro- 
cess would be too slow to be viable in practice. 


8 Conclusions 


Malware can use the Trusted Platform Module to make its 
computation significantly more difficult to analyze. Even 
though the TPM was intended to increase the security of 
computer systems, it can undermine computer security 
when used by malware. 

We explain several ways that TPM-enabled malware 
can be defeated using good engineering practice. TPMs 
will continue to be widely distributed only if they demon- 
strate value and do not bring harm. Establishing and dis- 
seminating good engineering practice for TPM manage- 
ment to both IT professionals and home users is an essen- 
tial part of the TPM’s future. 
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Abstract 


In recent years Internet miscreants have been leveraging 
the DNS to build malicious network infrastructures for 
malware command and control. In this paper we pro- 
pose a novel detection system called Kopis for detecting 
malware-related domain names. Kopis passively moni- 
tors DNS traffic at the upper levels of the DNS hierar- 
chy, and is able to accurately detect malware domains by 
analyzing global DNS query resolution patterns. 

Compared to previous DNS reputation systems such 
as Notos [3] and Exposure [4], which rely on monitor- 
ing traffic from local recursive DNS servers, Kopis offers 
a new vantage point and introduces new traffic features 
specifically chosen to leverage the global visibility ob- 
tained by monitoring network traffic at the upper DNS hi- 
erarchy. Unlike previous work Kopis enables DNS oper- 
ators to independently (i.e., without the need of data from 
other networks) detect malware domains within their au- 
thority, so that action can be taken to stop the abuse. 
Moreover, unlike previous work, Kopis can detect mal- 
ware domains even when no IP reputation information is 
available. 

We developed a proof-of-concept version of Kopis, 
and experimented with eight months of real-world data. 
Our experimental results show that Kopis can achieve 
high detection rates (e.g., 98.4%) and low false positive 
rates (e.g., 0.3% or 0.5%). In addition Kopis is able to 
detect new malware domains days or even weeks before 
they appear in public blacklists and security forums, and 
allowed us to discover the rise of a previously unknown 
DDoS botnet based in China. 


1 Introduction 


The Domain Name System (DNS) [17, 18] is a funda- 
mental component of the Internet. Over the years In- 
ternet miscreants have used the DNS to build malicious 
network infrastructures. For example, botnets [1,21, 27] 


USENIX Association 


and other types of malicious software make use of do- 
main names to locate their command and control (C&C) 
servers and communicate with attackers, e.g., to ex- 
filtrate stolen private information, wait for commands 
to perform attacks on other victim machines, etc. In 
response to this malicious use of DNS, static domain 
blacklists containing known malware domains have been 
used by network operators to detect DNS queries origi- 
nating from malware-infected machines and block their 
communications with the attackers [16, 19]. 


Unfortunately, the effectiveness of static domain 
blacklists are increasingly limited because there are now 
an overwhelming number of new domain names appear- 
ing on the Internet every day and attackers frequently 
switch to different domains to run their malicious activi- 
ties, thus making it difficult to keep blacklists up-to-date. 


To overcome the limitations of static domain black- 
lists, we need a detection system that can dynamically 
detect new malware-related domains. This detection sys- 
tem should: 


(1) Have global visibility into DNS request and response 
messages related to large DNS zones. This enables 
“early warning”, whereby malware domains can be 
detected before the corresponding malware infec- 
tions reach our local networks. 

Enable DNS operators to independently deploy the 
system and detect malware-related domains from 
within their authority zones without the need for data 
from other networks or other inter-organizational co- 
ordination. This enables practical, low-cost, and 
time-efficient detection and response. 

Accurately detect malware-related domains even in 
the absence of reputation data for the IP address 
space pointed to by the domains. IP reputation data 
is often difficult to accumulate and is fragile. This 
issue may become particularly important as IPv6 is 
deployed in the near future, due to the more expan- 
sive address space. 


(2 


wa 


(3 


wm 
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Figure 1: Overview of the levels at which Kopis, Notos, 
and Exposure perform DNS monitoring. 


Recently researchers have proposed two dynamic do- 
main reputation systems, Notos [3] and Exposure [4]. 
Unfortunately, while the results reported in [3, 4] are 
promising, neither Notos nor Exposure can meet all the 
requirements stated above, as Notos and Exposure rely 
on passive monitoring of recursive DNS (RDNS) traf- 
fic. As shown in Figure 1, they monitor the DNS queries 
from a (limited) number of RDNS servers (e.g., RDNS 
3 and 4), and have only partial visibility on DNS mes- 
sages related to large DNS zones. To obtain truly global 
visibility into DNS traffic related to a given DNS zone, 
these systems need access to a very large number of 
RDNS sensors in many diverse locations. This is not 
easy to achieve in practice in part due to operational 
costs, privacy concerns related to sharing data across or- 
ganizational boundaries, and difficulties in establishing 
and maintaining trust relationships between network op- 
erators located in different countries, for example. For 
the same reasons, Notos and Exposure have not been de- 
signed to be independently deployed and run by single 
DNS operators, because they rely on data sharing among 
several networks to obtain a meaningful level of visibility 
into DNS traffic. 

On the other hand, monitoring DNS traffic from the 
upper DNS hierarchy, e.g., at top-level domain (TLD) 
server A, and authoritative name servers (AuthNSs) B 
and C, offers visibility on all DNS messages related to 
domains on which A, B, and C have authority or are a 
point of delegation. For example, assuming B is the Au- 
thNS for the example .com zone, monitoring the DNS 
traffic at B provides visibility on all DNS messages from 
all RDNS servers around the Internet that query a domain 
name under the example. com zone. 
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Following this intuition, in this paper we propose a 
novel detection system called Kopis, which takes advan- 
tage of the global visibility available at the upper lev- 
els of the DNS hierarchy to detect malware-related do- 
mains. In order for Kopis to satisfy the three require- 
ments outlined above, it needs to deal with a number 
of new challenges. Most significantly, the higher up we 
move in the DNS hierarchy, the stronger the effects of 
DNS caching [15]. As a consequence, moving up in the 
hierarchy restricts us to monitoring DNS traffic with a 
coarser granularity. For example, at the TLD level we 
will only be able to see a small subset of queries to do- 
mains under a certain delegation point due to the effects 
of the DNS cache. 

Kopis works as follows. It analyzes the streams of 
DNS queries and responses at AuthNS or TLD servers 
(see Figure 1) from which are extracted statistical fea- 
tures such as the diversity in the network locations of the 
RDNS servers that query a domain name, the level of 
“popularity” of the querying RDNS servers (defined in 
detail in Section 4), and the reputation of the IP space 
into which the domain name resolves. Given a set of 
known legitimate and known malware-related domains 
as training data, Kopis builds a statistical classification 
model that can then predict whether a new domain is 
malware-related based on observed query resolution pat- 
terns. 

Our choice of Kopis’ statistical features, which we dis- 
cuss in detail in Section 4, is determined by the nature of 
the information accessible at the upper DNS hierarchy. 
As a result these features are significantly different from 
those used by RDNS-based systems such as Notos [3] 
and Exposure [4]. In particular, we were pleasantly sur- 
prised to find that, while Notos and Exposure rely heav- 
ily on features based on IP reputation, Kopis’ features 
enabled it to accurately detect malware-related domains 
even in the absence of IP reputation information. This 
may become a significant advantage in the near future 
because the deployment of IPv6 may severely impact the 
effectiveness of current IP reputation systems due to the 
substantially larger IP address space that would need to 
be monitored. 

To summarize, we make the following contributions: 


e We developed a novel approach to detect malware- 
related domain names. Our system leverages the 
global visibility obtained by monitoring DNS traf- 
fic at the upper levels of the DNS hierarchy, and can 
detect malware-related domains based on DNS res- 
olution patterns. 


Kopis enables DNS operators to independently (i.e., 
without the need of data from other networks) detect 
malware-domains within their scope of authority, so 
that action can be taken to stop the abuse. 
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e We systematically examined real-world DNS traces 
from two large AuthNSs and a country-code level 
TLD server. We performed a rigorous evaluation 
of our statistical features and identified two new 
feature families that, unlike previous work, enable 
Kopis to detect malware domains even when no IP 
reputation information is available. 


e We developed a proof-of-concept version of Kopis, 
and experimented with eight months of real-world 
data. Our experimental results show that Kopis can 
achieve high detection rates (e.g., 98.4%) and low 
false positive rates (e.g., 0.3% or 0.5%). More sig- 
nificantly, Kopis was able to identify previously un- 
known malware domain names several weeks be- 
fore they appeared in blacklists or in security fo- 
rums. In addition, using Kopis we detected the 
rise of a previously unknown DDoS botnet based 
in China. 


2 Background and Related Work 


DNS Concepts and Terminology The domain name 
space is structured like a tree. A domain name identi- 
fies a node in the tree. For example, the domain name 
F.D.B.A. identifies the path from the root “.” to a node 
F in the tree (see Figure 2(a)). The set of resource infor- 
mation associated with a particular name is composed of 
resource records (RRs) [17, 18]. The depth of a node in 
the tree is sometimes referred to as domain level. For 
example, A. is a top-level domain (TLD), B.A. is a 
second-level domain (2LD), D.B.A. is a third-level do- 
main (3LD), and so on. 

The information related to the domain name space is 
stored in a distributed domain name database. The do- 
main name database is partitioned by “cuts” made in the 
name space between adjacent nodes. After all cuts are 
made, each group of connected nodes represent a sep- 
arate zone [17]. Each zone has at least one node, and 
hence a domain name, for which it is authoritative. For 
each zone, a node which is closer to the root than any 
other node in the zone can be identified. The name of this 
node is often used to identify the zone. The RRs of the 
nodes in a given zone are served by one or more authori- 
tative name servers (AuthNSs). AuthNSs that have com- 
plete knowledge about a zone (i.e., they store the RRs for 
all the nodes related to the zone in question in its zone 
files) are said to have authority over that zone [17, 18]. 
AuthNSs will typically support one or more zones, and 
can delegate the authority over part of a (sub-)zone to 
other AuthNSs. 

DNS queries are usually initiated by a stub resolver 
on a user’s machine, which relies on a recursive DNS re- 
solver (RDNS) for obtaining a set of RRs owned by a 
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(a) DNS Tree (b) Domain Resolution 


Figure 2: Example of DNS tree and domain resolution 
process. 


given domain name. The RDNS is responsible for di- 
rectly contacting the AuthNSs on behalf of the stub re- 
solver to obtain the requested information, and return it 
to the stub resolver. The RDNS is also responsible for 
caching the obtained information up to a certain period 
of time, called the Time To Live (TTL), so that if the same 
or another stub resolver queries again for the same in- 
formation within the TTL time window, the RDNS will 
not need to contact the authoritative name servers (thus 
improving efficiency). Figure 2(b) enumerates the steps 
involved in a typical query resolution process, assuming 
an empty cache. 


Related Work To the best of our knowledge, Wessels 
et al. [30] were the first to analyze DNS query data as 
seen from the upper DNS hierarchy. The authors fo- 
cused on examining the DNS caching behavior of re- 
cursive DNS servers from the point of view of AuthNS 
and TLD servers, and how different implementations of 
caching systems may affect the performance of the DNS. 

Recently, Hao et al. [13] released a report on DNS 
lookup patterns measured from the . com TLD servers. 
Their preliminary analysis shows that the resolution 
patterns for malicious domain names are sometimes 
different from those observed for legitimate domains. 
While [13] only reports some preliminary measurement 
results and does not discuss how the findings may be 
leveraged for detection purposes, it does hint that a mal- 
ware detection system may be built around TLD-level 
DNS queries. We designed Kopis to do just that, namely 
monitor query streams at the upper DNS hierarchy and 
be able to detect previously unknown malware domains. 

Several studies provide deep understanding behind 
the properties of malware propagation and botnet’s life- 
time [7, 25,29]. An interesting observation among all 
these research efforts is the inherent diversity of the bot- 
net’s infected population. Collins et al. [6] introduced 
and quantified the notion of “network uncleanliness” 


20th USENIX Security Symposium 413 


414 


from the temporal and spatial network point of view, 
showing that it is very probable to have a large number 
of infected bots in the same network over an epoch. They 
also discuss that this could be a direct effect of the net- 
work policy enforced at the edge. Kopis directly uses 
the intuition behind these past research efforts in the re- 
quester diversity and requester profile statistical feature 
families. 

A number of research efforts can be found in the 
area of DNS blacklisting and reputation. Felegyhazi et 
al. [11] recently proposed a DNS reputation blacklisting 
methodology based on WHOIS information, while An- 
tonakakis et al. [3] and Bilge et al. [4] propose dynamic 
reputation systems based on passive RDNS monitoring. 
Our system is complementary to the above mentioned 
works. To the best of our knowledge, we are the first 
to analyze DNS query patterns at the AuthNS and TLD 
server level for the purpose of detecting domain names 
related to malware. 


3 System Overview 


Kopis monitors streams of DNS queries to and responses 
from the upper DNS hierarchy, and detects malware do- 
main names based on the observed query/response pat- 
terns. An overview of Kopis is shown in Figure 3. 

Our system divides the monitored data streams into 
epochs { £;,};=-1..m (currently, an epoch is one day long). 
At the end of each epoch Kopis summarizes the DNS 
traffic related to a given domain name d by computing 
a number of statistical features, such as the diversity of 
the IP addresses associated with the RDNS servers that 
queried d, the relative volume of queries from the set of 
querying RDNS servers, historic information related to 
the IP space pointed to by d, etc. We defer a detailed de- 
scription and motivations regarding the features we mea- 
sure to Section 4. For now, it suffices to consider the 
feature computation module in Figure 3 as a function 
F(d, E;) = v’, that maps the DNS traffic in epoch EF; 
related to d into a feature vector v'. 

Kopis operates in two modes: a training mode and an 
operation mode. In training mode, Kopis makes use of a 
knowledge base KB, which consists of a set of known 
malware-related and known legitimate domain names 
(and related resolved IPs) for which the monitored Au- 
thNS and TLD servers are authoritative or a point of del- 
egation. Kopis’ learning module takes as input the set of 
feature vectors Virain = {v'}i=1..m,Vd € KB, which 
summarizes the query/response behavior of each domain 
in the knowledge base across m days. Each domain in 
KB, and in turn each feature vector in Virgin, iS aSSOCi- 
ated with a label, namely legitimate or malware. We can 
therefore use supervised learning techniques [5] to learn 
a Statistical classification model S of DNS query patterns 
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Figure 3: A high-level overview of Kopis. 
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related to legitimate and malware domains as seen from 
the upper DNS hierarchy. 

In operation mode, Kopis monitors the streams of 
DNS traffic and, at the end of each epoch E;, maps each 
domain d’ ¢ KB (e., all unknown domains) extracted 
from the query/response streams into a feature vector v%,. 
At this point, given a domain d’ the statistical classifier 
S (see Figure 3) assigns a label /g-,; and a confidence 
score c(la’,;), which express whether the query/response 
patterns observed for d’ during epoch E; resemble ei- 
ther known legitimate or malware behavior, and with 
what probability. In order to make a final decision about 
d’, Kopis first gathers a series of labels and confidence 
scores S(v3,) = {la j,c(la,j)},9 =t,., (6+ m) for m 
consecutive epochs, where ¢ refers to a given starting 
epoch F,. Finally, Kopis computes the average confi- 
dence scores Cy = avg;{c(la,j)} for the malware la- 
bels assigned to d’ by S across the m epochs, and an 
alarm is raised if Cy, is greater than a threshold 0. 


4 Statistical Features 


In this section we describe the statistical features that 
Kopis extracts from the monitored DNS traffic. For each 
DNS query q; regarding a domain name d and the re- 
lated DNS response r;, we first translate it into a tuple 
Q,(d) = (T;, Rj, d,IPs;), where T; identifies the epoch 
in which the query/response was observed, f; is the IP 
address of the machine that initiated the query q;, d is 
the queried domain, and /Ps, is the set of resolved IP 
addresses as reported in the response 7;. It is worth not- 
ing that since we are monitoring DNS queries and re- 
sponses from the upper DNS hierarchy, in some cases the 
response may be delegated to a name server which Kopis 
does not currently monitor. This is particularly relevant 
to our TLD-level data feed, since most TLD servers are 
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delegation-only'. In all those cases in which the response 
does not carry the resolved IP addresses, we can derive 
the [Ps set by leveraging a passive DNS database [24], or 
by directly querying the delegated name server. 

Given a domain name d and a series of tuples 
Q,(d),7 = 1,..,m, measured during a certain epoch FE, 
(i.e., ZT} = Ey, Vj = 1,..,m), Kopis extracts the follow- 
ing groups of statistical features: 


Requester Diversity (RD) This group of features aims 
to characterize if the machines (e.g., RDNS servers) that 
query a given domain name are localized or are globally 
distributed. In practice, given a domain d and a series 
of tuples {Q;(d)};=1..m, we first map the series of re- 
quester IP addresses {Rj }j=1..m to the BGP prefix, au- 
tonomous system (AS) numbers, and country codes (CC) 
the IP addresses belong to. Then, we compute the distri- 
bution of occurrence frequencies of the obtained BGP 
prefixes (sometimes referred to as classless inter-domain 
routing (CIDR) prefixes), the AS numbers and CCs. 

For each of these three distributions we compute the 
mean (three features) , standard deviation (three features) 
and variance (three features). Also, we consider the ab- 
solute number of distinct IP addresses (i.e., distinct val- 
ues of {R; }j;=1..m), the number of distinct BGP prefixes, 
AS numbers and CCs (four features in total). Overall, we 
obtain thirteen statistical features that summarize the di- 
versity of the machines that query a particular domain 
name, as seen from an AuthNS or TLD server. 

The choice of the RD features is motivated by the ob- 
servation that the distribution of the machines on the In- 
ternet that query malicious domain names is on average 
different from the distribution of IP addresses that query 
legitimate domains. Semi-popular legitimate domain 
names (i.e., small business or personal sites) will not 
have a stable diverse population of recursive DNS servers 
or stubs that will try to systematically contact them. On 
the other hand popular legitimate domain names (i.e., 
zone cuts, authoritative name servers, news/blog forums, 
etc.) will demonstrate a very consistent and very diverse 
pool of IP addresses looking them up on a daily basis. 

Malware-related domain names will have a diverse 
pool of IP addresses looking them up in a systematic way 
(i.e., multiple contiguous days). These IP addresses are 
very likely to have a significant network and geograph- 
ical diversity simply because with the exception of tar- 
geted attacks adversaries will not try to control or restrain 
the geographical and network distribution of the ma- 
chines getting compromised by drive-by sites and other 
social networking techniques. Intuitively, the diversity of 


'Delegation-only DNS servers are effectively limited to containing 
NS resource records for sub-domains, but no actual data beyond its own 
SOA and NS records. 
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Figure 4: Distribution of AS-diversity (a) and CC- 
diversity (b) for malware-related and benign domains. 


the infected population will be different over a given time 
period, in comparison to that of benign domain names. 

For example, Figure 4(a), which is derived from the 
dataset described in Section 5.3, reports the cumulative 
distribution functions (CDF) of the AS diversity of be- 
nign and malware-related domain names. In Figure 4(b) 
we can see the CDFs from the CC diversity for both 
classes in our dataset. We note that in both cases the 
benign domain names have a bimodal distribution. They 
either have low or very high diversity. On the other hand, 
the malware-related domain names cover a larger spec- 
trum of diversities based on the success of the malware 
distribution mechanisms they use. 


Requester Profile (RP) Not all query sources have 
similar characteristics. Given a query tuple Q;(d) = 
(T;, R;,d, IPs;), the requester’s IP address R; may rep- 
resent the RDNS server of a large ISP that queries do- 
mains on behalf of millions of clients, the RDNS of a 
smaller organization (e.g., an academic network), or a 
single end-user machine. We would like to distinguish 
between such cases, and assign a higher weight to RDNS 
servers that serve a large client population because a 
larger network would typically have a larger number of 
infected machines. While it is not possible to precisely 
estimate the population behind an RDNS server, because 
of the effects of caching [15], we approximate the pop- 
ulation measure as follows. Without loss of generality, 
assume we monitor the DNS query/response stream for 
a large AuthNS that has authority over a set of domains 
D. Given an epoch F;, we consider all query tuples 
{Q;(d)}, Vj, d seen during E;. Let R be the set of all 
distinct requester IP addresses in the query tuples. For 
each IP address R;, € R, we count the number c;,, of 
different domain names in D queried by R, during E;. 
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We then define the weight associated to a requester’s IP 


Cc . . 
address Ry as win = — jr. In practice, we assign 
I 


max, _; Ct,1 
a higher weight to requesters that query a large number 
of domains in D. 

Now that we have defined the weights w;,;, given a 
domain name d’ we measure its RP features as follows: 


e Let {Q;(d’)}:=1..n be the set of query tuples related 
to d’ observed during an epoch E;. Also, let R(d’) 
be the set of all distinct requester IP addresses in 
{Q;(d’)}i=1..,. For each R, € R(d’) we com- 
pute the count c;,, as previously described. Then, 
given the set C;(d’) = {cz}, we compute the av- 
erage, the biased and unbiased standard deviation?, 
and the biased and unbiased variance of the values 
in C,(d’). It is worth noting that the biased and 
unbiased estimators of the standard deviation and 
variance have different values when the cardinality 
|C,(d’)| is small. 


e Similar to the above, for each R;, € R(d’) we 
compute the count c,,,. Afterwards, we multiply 
each count by the weight w;_n,, to obtain the set 
WCi(d') = {c1.4 * Wen, }% Of weighted counts. 
It is worth noting that the weights wy_n,~ are com- 
puted based on historical data about the resolver’s 
IP address collected n epochs (seven days in our 
experiments) before the epoch E,. We then com- 
pute the average, the biased and unbiased standard 
deviation, and the biased and unbiased variance of 
the values in WC;(d’). 


The RD and RP features described above aim to cap- 
ture the fact that malware-related domains tend to be 
queried from a diverse set of requesters with a higher 
weight more often than legitimate domains. An explana- 
tion for this expected difference in the requester char- 
acteristics is that malware-related domains tend to be 
queried from a large number of ISP networks, which 
usually are assigned a high weight. The reason is that 
ISP networks often offer little or no protection against 
malware-related software propagation. In addition, the 
population of machines in ISP networks is usually very 
large, and therefore the probability that a machine in the 
ISP network becomes infected by malware is very high. 
On the other hand, legitimate domains are often queried 
from both ISP networks and smaller organization net- 
works (having a smaller weight), such as enterprise net- 
works, which are usually better protected against mal- 
ware and tend to query fewer malware-related domains. 


2The biased estimator for the standard deviation of a random vari- 


able X is defined as ¢ = es w (X; — 1)?, while the unbiased 
estimator is defined as ¢ = oo wo (Xi — p)? 
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As shown in Section 5 both set of features can success- 
fully model benign and malware-related domain names. 


Resolved-IPs Reputation (IPR) This group of fea- 
tures aims to describe whether, and to what extent, the 
IP address space pointed to by a given domain has been 
historically linked with known malicious activities, or 
known legitimate services. We compute a total of nine 
features as follows. Given a domain name d and the set 
of query tuples {Q,(d)};=1.., obtained during an epoch 
FE, we first consider the overall set of resolved IP ad- 
dresses IPs(d, t) = Uh_ IPs; (where /Ps, is an element 
of the tuple Q;(d), as explained above). Let BGP(d, t) 
and AS(d, t) be the set of distinct BGP prefixes and au- 
tonomous system numbers to which the IP addresses in 
IPs(d,t) belong, respectively. We compute the follow- 
ing groups of features. 


e Malware Evidence: includes the average number of 
known malware-related domain names that in the 
past month (with respect to the epoch EF;) have 
pointed to each of the IP addresses in IPs(d, t). 
Similarly, we compute the average number of 
known malware-related domains that have pointed 
to each of the BGP prefixes and AS numbers in 
BGP (d,t) and AS(d, t). 


e SBL Evidence: much like the malware evidence fea- 
tures, we compute the average number of domains 
from the Spamhaus Block List [22] that, in the past 
have pointed to each of the IP addresses, BGP pre- 
fixes, and AS numbers in IPs(d,t), BGP(d,t), 
and AS(d, t), respectively. 


e Whitelist Evidence: We compute the number of 
IP addresses in IPs(d,t) that match IP addresses 
pointed to by domains in the DNSWL [9] ? or 
the top 30 domains according to Alexa [2]. Sim- 
ilarly we compute the number of BGP prefixes in 
BGP(d,t) and AS numbers in AS(d, t) that in- 
clude IP addresses pointed by domains in DNSWL 
or the top 30 Alexa domains. 


The IPR features try to capture whether a certain do- 
main d is related to domain names and IP addresses that 
have been historically recognized as either malicious or 
legitimate domains. The intuition is that if d points into 
IP address space that is known to host lots of malicious 
activities, it is more likely that d itself is also involved in 
malicious activities. On the other hand, if d points into a 
well known, professionally run legitimate network, it is 
somewhat less likely that d is actually involved in mali- 
cious activities. 

3Domain names up to the LOW trustworthiness score, where LOW 


trustworthiness score follows the definition by DNSWL [9]. More de- 
tails can be found at http: //www.dnswl.org/tech. 
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Discussion While none of the features used alone 
may allow Kopis to accurately discriminate between 
malware-related and legitimate domain names, by com- 
bining the features described above we can achieve a 
high detection rate with low false positives, as shown in 
Section 5. 

We would like to emphasize that the features com- 
puted by Kopis, particularly the Requester Diversity and 
Requester Profile features, are novel and very differ- 
ent from the statistical features proposed in Notos [3] 
and Exposure [4], which are heavily based on IP repu- 
tation information. Unlike Notos and Exposure, which 
leverage RDNS-level DNS traffic monitoring, Kopis ex- 
tracts statistical features specifically chosen to harvest 
the “malware signal” as seen from the upper DNS hi- 
erarchy, and to cope with the coarser granularity of the 
DNS traffic observed at the AuthNS and TLD level. Fur- 
thermore, we show in Section 5 that, unlike previous 
work, Kopis is able to detect malware-related domains 
even when no IP reputation information is available. 

The Requester Diversity and Requester Profile fea- 
tures can operate without any historical IP address rep- 
utation information. These two sets of features can be 
computed practically and on-the-fly at each authoritative 
or TLD server. The main reason why we identify the 
six Resolved-IP Reputation features is to harvest part of 
the already established IP reputation in IPv4. This will 
help the overall system to reduce the false positives (FPs) 
and at the same time maintain a very high true positives 
(TPs). We will elaborate more in Section 5 on the differ- 
ent operational modes of Kopis. 


5 Evaluation 


In this section, we report the results of our evaluation of 
Kopis. First, we describe how we collected our datasets 
and the related ground truth. We then present results re- 
garding the detection accuracy of Kopis for authoritative 
NS- and TLD-level deployments. Finally, we present a 
case study regarding how Kopis was able to discover a 
previously unknown DDoS botnet based in China. 


5.1 Datasets 


Our datasets were composed of the DNS traffic obtained 
from two major domain name registrars between the 
dates of 01-01-2010 up until 08-31-2010 and a country 
code top level domain (. ca) between the dates of 08-26- 
2010 up until 10-18-2010. In the case of the two domain 
name registrars we were also able to observe the answers 
returned to the requester of each resolution. Therefore, it 
is easy for us to identify the IP addresses for the A-t ype 
of DNS query traffic. In the case of the TLD we obtained 
data only for 52 days and had to passively reconstruct the 
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Figure 5: General observations from the datasets. Plot 
(i) shows the difference between the raw lookup volume 
vs. the query tuples that Kopis uses over a period of 107 
days. Plots (ii), (ii) and (iv) show the number of unique 
CCs, ASs and CIDRs (in which the RDNSs resides) for 
each domain name that was looked up during one day. 


IP addresses corresponding to the A-type of lookups 
observed. 


An interesting problem arises when we work with the 
large data volume from major authorities and the .ca 
TLD servers. According to a sample monitoring pe- 
riod of 107 days we can see from Figure 5 (i) that the 
daily number of lookups to the authorities was on aver- 
age 321 million. This was a significant problem since 
it would be hard to process such a volume of raw data, 
especially if the temporal information from these daily 
observations were important for the final detection pro- 
cess. On the same set of raw data we used a data reduc- 
tion process that maintained only the query tuples (as de- 
fined in Section 4). This reduced the daily observations, 
as we can observe from Figure 5 (i), to a daily average 
of 12,583,723 unique query tuples. The signal that we 
missed with this reduction was the absolute lookup vol- 
ume of each query tuple in the raw data. Additionally, we 
missed all time sensitive information regarding the peri- 
ods within a day that each query tuple was looked up. As 
we will see in the following sections, this reduction does 
not affect Kopis’ ability to model the profile of benign 
and malware-related domains. 

Figures 5 (ii), (iii) and (iv) report the number of 
CIDR (i.e., BGP prefixes), Autonomous Systems (AS), 
Country Code (CC), respectively, for the RDNSs (or re- 
questers) that looked up each domain name every day. 
The domains are sorted based on counts of ASs, CCs 
and CIDRs corresponding to the RDNSs that look them 
up (from left to right with the leftmost having the largest 
count). We observe that roughly the first 100,000 do- 
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main names were the only domains that exhibit any di- 
versity among the requesters that looked them up. We 
can also observe that the first 10,000 domain names are 
those that have some significant diversity. In particular 
only the first 10,000 domain names were looked up by 
at least five CIDRs, or five ASs or two different CCs. In 
other words, the remaining domains were looked up from 
very few RDNSs, typically in small sets of networks and 
a small number of countries. Using this observation we 
created statistical vectors only for domain names in the 
sets of the 100,000 most diverse domains from the point 
of view of the RDNS’s CC, AS and CIDR. 


5.2 Obtaining the Ground Truth 


We collected more than eight months of DNS traffic from 
two DNS authorities and the .ca TLD. All query tuples 
derived from these DNS authorities were stored daily and 
indexed in a relational database. Due to some monitor- 
ing problems we missed traffic from 3 days in January, 9 
days in March and 6 days in June 2010. 

Some of our statistical features require us to map each 
observed IP address to the related CIDR (or BGP prefix) 
AS number and country code (Section 4). To this end, 
we leveraged Team CYMRU’s IP-to-ASN mapping [28]. 

Kopis’ knowledge base contained malware informa- 
tion from two malware feeds collected since March 2009. 
We also collected public blacklisting information from 
various publicly available services (e.g., Malwaredo- 
mains [16], Zeus tracker [31]). Furthermore, we col- 
lected information regarding domain names residing in 
benign networks from DNSWL [9] but also the address 
space from the top 30 Alexa [2] domains verified using 
the assistance of the Dihe’s IP address index browser [8]. 
Overall, we were able to label 225,429 unique RRs that 
correspond to 28,915 unique domain names. From those 
we had 1,598 domain names labeled as legitimate and 
27,317 domain names labeled as malware-related. All 
collected information was placed in a table with first and 
last seen timestamps. This was important since we com- 
puted all IPR features for day n based only on data we 
had until day n. Finally, we should note that we labeled 
all the data based on black-listing and white-listing in- 
formation collected until October 31%* 2010. 


5.3. Model Selection 


As described in Section 3, Kopis uses a machine learn- 
ing algorithm to build a detector based on the statistical 
profiles of resolution patterns of legitimate and malware- 
related domains. As with any machine-learning task, it 
is important to select the appropriate model and impor- 
tant parameters. For Kopis, we need to identify the min- 
imal observation window of historic data necessary for 
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ROCs for Kopis Under Different Sizes of Temporal Windows. 
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Figure 6: ROCs from datasets with different sizes assem- 
bled from different time windows. 


training. The observation window here is the number of 
epochs from which we assemble the training dataset. 

In Figure 6, we see the detection results from four 
different observation windows. The ROCs in Figure 6 
were computed using 10-fold cross validation. The clas- 
sifier that produced these results was a random forest 
(RF) classifier under a two, three, four and five day 
training window. The selection of the RF classifier was 
made using a model selection process [10], a common 
method used in the machine learning community, which 
identified the most accurate classifier that could model 
our dataset. Besides the RF, during model selection we 
also experimented with Naive Bayes, k-nearest neigh- 
bors (IBK), Support Vector Machines, MLP Neural Net- 
work and random committee (RC) classifiers [10]. The 
best detection results reported during the model selection 
were from the RF classifier. Specifically, the RF classi- 
fier achieved a TPrate = 98.4% anda FPrate = 0.3% 
using a five day observation window. When we increased 
the observation window beyond the mark of five days we 
did not see a significant improvement in the detection re- 
sults. 

We should note that this parameter and model method- 
ology should be used every time Kopis is being deployed 
in a new AuthNS or TLD server because the character- 
istics of the domains, and hence the resolution patterns, 
may vary in different AuthNS and TLD servers, and dif- 
ferent patterns or profiles may best fit different parameter 
values and classifiers. 


5.4 Overall Detection Performance 


In order to evaluate the detection performance of Kopis 
and in particular the validity and strength of its statistical 
features and classification model, we conducted a long- 
term experiment with five months of data. We used 150 
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Figure 7: The distribution of T’P,ate for combination of 
features and features families in comparison with Kopis 
observed detection accuracy. 
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Figure 8: The distribution of F’P,a¢e for combinations of 
features and features families in comparison with Kopis 
observed detection accuracy. 


different datasets created over a period of 155 days (first 
15 days for bootstrap). These datasets were composed by 
using a fifteen-day sliding window with a one-day step 
(i.e., two consecutive windows overlap by 14 days). We 
then used 10-fold cross validation‘ to obtain the F Prates 
and T’P,ates from every dataset. We picked three clas- 
sification algorithms, namely, RF, RC, and IBK, which 
performed best in the model selection process (described 
in Section 5.3) because we wanted to use their detection 
rates during the long-term experiment. 

In Figure 7 and Figure 8 we observe the distribution 


4To avoid overfitting our dataset we report the evaluation results us- 
ing 10-fold cross validation that implies that 90% of dataset is used for 
training and 10% for testing — in each of the 10 folds. This technique 
is known [14] to yield a fair estimation of classification performance 
over a dataset. 
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of the T’Pyates and F'P,atres for the RF classifier over 
the entire evaluation period. The average, minimum and 
maximum F’P,qtes for the RF were 0.5% (8 domains), 
0.2% (3 domains) and 1.1% (18 domains), respectively, 
while the average, minimum and maximum T'P,ates 
were 99.1% (27,072 domains), 98.1% (27.071 domains) 
and 99.8% (27,262 domains), respectively. The RF clas- 
sifier’s F’ P,ates were almost consistently around 0.6% or 
less. The T’P,.q¢- of the RF classifier, with the exception 
of six days, was above 96% and typically in the range 
of 98%. With the IBK classifier being the exception, the 
RF and RC classifiers had similar longterm detection ac- 
curacy. This experiment showed that Kopis overall has 
a very high T’P, te and very low F’P,ate against all new 
and previously unclassified malware-related domains. 
As described in Section 4, we define three main types 
of features. Next we show how Kopis would oper- 
ate if trained on datasets assembled by features from 
each family, first separately and then combined. To de- 
rive the results from the experiments, we used as in- 
put the 150 datasets created in the previously described 
longterm evaluation mode. Then, for each one of these 
150 datasets, we isolated the features from the RD, 
RP and IPR feature families into three additional types 
of datasets. In Figure 7 and Figure 8 we present the 
longterm detection rates obtained using 10-fold cross 
validation of these three different types of datasets. Ad- 
ditionally, we present the detection results from: 


e The combination of RP and RD features (RD+RP 
Features). 


e The combination of RD, RP and the features 
from the IPR feature family that describe the Au- 
tonomous System properties of the IP address that 
each domain name d points at (RD+RP+IRP (AS) 
Features). 


e The detection results from the combination of all 
features combined (All Features). 


The longterm F’Pyates and T’P,ates in Figure 7 and 
Figure 8 respectively, we show the detection accura- 
cies from each different feature set. One may tend to 
think that the IPR UIP reputation) features hold a signifi- 
cantly stronger classification signal than the combination 
of RD and RP features, mainly because there are many 
resources that currently contribute to the quantification 
and improvement of IP reputation (i.e., spam block lists, 
malware analysis, dynamic DNS reputation etc.). How- 
ever, Figure 7 and Figure 8 show that with respect to both 
the F’Prates and T Prates, the combination of the RD and 
RP sets of features performs almost equally to the IPR 
features used in isolation from the remaining features. 
At the same time, using all features performs much bet- 
ter than using each single feature subset in isolation. This 


20th USENIX Security Symposium 419 


420 





























oo 
@ 
a 
a 
FE 
03 i i i i i i i i 
10 20 30 40 50 60 70 80 
IPR Features —+— RP+RD Features ---»--- All Features ----*--- 
0.9 T T T tT T T T 
0.8 F i * 4 
git ay : k 
07% onman ened Mae. and A 
2 . oe ia ad 
06 L : or o 
ac 05} oon 
Q 04>; 
F oa} 
0.2 F 
0.1 ‘ 4 


i 
0 10 20 30 40 50 60 70 80 
Days 


RP Features —+— RD Features ---»*--- RP+RD+AS(IPR) ----*--- 


Figure 9: T’P,ates for different observation periods using 
an 80/20 train/test dataset split. 


shows that the combination of the RP and RD features 
contribute significantly to the overall classification accu- 
racy and can enable the correct classification of domains 
in environments where IP reputation is absent or in cases 
where we cannot reliably compute IP reputation features 
“on-the-fly” (e.g., in some TLD-level deployments). 


5.5 New and Previously Unclassified Do- 
mains 


While the experiments described in Section 5.4 showed 
that Kopis can achieve very good overall detection accu- 
racy, we also wanted to evaluate the “real-world value” 
of Kopis, and in particular its ability to detect new and 
previously unclassified malware domains. To this end, 
we conducted a set of experiments in which we trained 
Kopis based on one month of labeled data from which 
we randomly excluded 20% of both benign and malware- 
related domains (i.e., we assumed that we did not know 
anything about these domain names during training). 
This excluded 997 benign and 4,792 malware-related 
unique, deduplicated domain names from the training 
datasets. Then we used the next three weeks of data as 
an evaluation dataset, which contained the domains ex- 
cluded from the training set mentioned above, as well as 
all other newly seen domain names. In other words, the 
classification model learned using the training data was 
not provided with any knowledge whatsoever about the 
domains in the evaluation dataset. 

We then classified the domains in the evaluation 
dataset, with the assistance of a Random Forest classi- 
fier, as we already discussed in Section 3. We used a 
training period of 30 consecutive days and a testing pe- 
riod of m = 21 days immediately following the training 
period. The detection threshold 6 was set to 0.9 to obtain 
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Figure 10: F’P,q¢-, for different observation periods us- 
ing an 80/20 train/test dataset split. 


a good operational trade-off between false positives and 
detection rate. Our primary reasoning behind setting the 
threshold @ to 0.9 was to keep the F’'P,aies aS low as pos- 
sible so that an operator would only have to deal with a 
very small number of FPs on a daily basis. We repeated 
this evaluation four times during different months within 
our eight months of traffic monitoring. 


In Figure 9 and Figure 10, we can see the results 
of these experiments. From left to right, we can see 
the evaluation on 21 days of traffic in February, March, 
May and June of 2010. We trained the system based 
on one month of traffic from January, February, March 
and May 2010, respectively. We chose these months be- 
cause we had continuous daily observations (i.e., no data 
gaps) from both training and testing datasets. As in the 
longterm 10-fold evaluation, we performed the experi- 
ments using six different datasets obtained using differ- 
ent feature subsets. 


We present the results in the same way as in Sec- 
tion 5.4. When we used all features we observed the av- 
erage F'P,ates Was 0.53% ( ~ two domains), while the 
average T'Prates Was 73.62% (3,528 domain names). For 
the RP+RD Features and IPR Features the aver- 
age F'P,ates were 0.54% (~ two domains) and 0.79% (~ 
two domains), respectively; while the average T’Prates 
were 69.19% (3,315 domain names) and 87.25% (4,181 
domain names), respectively. The RP+RD+AS (IPR) 
Features, gave average F’Prgiecs = 0.66% (or ~ 
two domain names) and average T P,-ates = 65.05% (or 
3,117 domain names). 


When we used the combination of all features we see 
that for the first 42 days of evaluation (February and 
March of 2010) Kopis had a virtually zero F'P,ate; and 
an average T'P,-ates = 68%. In the following 42 days of 
evaluation, Kopis, had better T’P,ate; but with some ex- 
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Figure 11: Kopis early detection results. The deltas in 
days between the Kopis classification dates and the date 
we've received a corresponding malware sample for the 
domain name. 


tra false positives, always below 0.5%. Investigating the 
nature of the false positives, we observed that the domain 
names responsible are related to BitTorrent services, on- 
demand web-TV services and what appeared to be on- 
line gaming sites. We suspect that the main reason why 
these domains cause false positives is because the pop- 
ulation of similar legitimate services was insufficiently 
represented during training, and therefore, the RF clas- 
sifier failed to learn this behavior as being legitimate in 
training. 

This experiment showed that Kopis — with all fea- 
tures used — can detect new and previously unclassified 
domains with an average T’P,ate of 73.62% and average 
F’ Prate of 0.53%. Although this is worse than the overall 
detection performance reported in Section 5.4, it is actu- 
ally a good result considering that Kopis has no knowl- 
edge of the domains in the testing dataset. It implies that 
Kopis has good “real-world value” thanks to its ability to 
detect new, previously unseen attacks is at a premium. 

Figure 11 shows the difference in days between the 
time that Kopis identifies a true positive domain as being 
malware-related, and the day we first obtained the mal- 
ware sample associated with the malware-related domain 
from our malware feed. To perform this measurement, 
we used malware from a commercial malware feed with 
volume between 400 MB to 2 GB of malware samples 
every day. Additionally, we used malware captured from 
two corporate networks. As we can see, Kopis was able 
to identify domain names on the rise even before a cor- 
responding malware sample is accessible by the security 
community. This result shows that Kopis can provide the 
ability to the registrars and TLD operators to preemp- 
tively block or take down malware related domains and 
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remove botnets from the Internet before they become a 
large security threat. 


5.6 Canadian TLD 


Thus far, the experiments we have reported were all us- 
ing data available at AuthNSs. A TLD server is one level 
above AuthNS servers in the DNS hierarchy, and as such, 
it has a greater global visibility but with less granular 
data on DNS resolution behaviors. In this section we re- 
port our experiments of Kopis at the TLD level. 

We evaluated Kopis on query data obtained from the 
Canadian TLD. We used the same evaluation method in- 
troduced in Section 5.5 but with different training win- 
dow sizes, testing epochs and classification thresholds. 
Before we describe the results, we should note that all 
TLD traffic needs passive reconstruction of the query 
data to identify the IPs addresses in the A-type re- 
source records. We used a passive DNS database com- 
posed of data from four ISP sensors and the passive DNS 
database from SIE [24]. The Canadian TLD’s traffic was 
harvested from SIE [24] (channel three). 

Unfortunately, due to the fact that we obtained traf- 
fic from only 52 days (2010-08-26 until 2010-10-18) we 
had to use a smaller training epoch of 14 days (instead of 
one month). We evaluated Kopis using the RF classifier, 
14 consecutive days as the training epoch, 14 days fol- 
lowing the training epoch as the evaluation epoch, and 
setting the threshold 6 = 0.9. Two sequential training 
epochs had seven days in common. The exact training 
epochs were 08-27 to 09-11, 09-04 to 09-18, 09-11 to 
09-25 and 09-18 to 10-02 while the corresponding eval- 
uation epochs were 09-12 to 09-26, 09-19 to 10-03, 09- 
26 to 10-10 and 10-03 to 10-17, respectively. Without 
changing the data labeling methodology, we assembled 
a dataset with 2,199 malware related and 1,018 benign 
unique deduplicated domain names. 

In Figure 12 and Figure 13, we can see the results of 
this experiment. As with the experiments in Section 5.5, 
we evaluated Kopis in six modes, using as threshold 
6 = 0.5. We should note here that the evaluation of the 
RD+RP Features reflects the evaluation mode with 
datasets that were composed only by the combination of 
RD and RP features. Such dataset can be extracted di- 
rectly from data readily available at a TLD server (in 
other words, the RD+RP Features is the most “effi- 
cient” mode that Kopis can operate in and can be com- 
puted on the fly at a TLD server). 

When we used all features we observed the av- 
erage F'Prates was 0.52% (~ six domain names), 
while the average T’P,aie, was 94.68% (2,082 do- 
main names). For the RP+RD Features and IPR 
Features the average [’Prates were 3.18% (~ 33 do- 
main names) and 0.36% (~ four domain names), respec- 
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Figure 12: T’P, ate, achieved during evaluation of traffic 
obtained from .ca TLD. 


tively; while the average T’Prates were 63.63% (1,399 
domain names) and 10.84% (238 domain names), re- 
spectively. The RP+RD+AS (IPR) Features, gave 
the average F’Paies = 1.03% (or ten domain names) 
and average TPrates = 78.95% (or 1,736 domain 
names). 

During the RP+RD Features evaluation, we ob- 
served that the average T’P,ates reached 63.63% while 
the average F'P,aics were in the range of 3.18%. These 
were very promising results despite the relatively high 
F’ Pyates because we can operate Kopis using a sequential 
classification mode, starting with RP+RD Features 
followed by All Features. Kopis in this “in-series” 
classification mode can achieve a good balance of effi- 
ciency and accuracy. 

More specifically, at the first step in the sequential pro- 
cess, Kopis is a “coarse filter” that operates in RP+RD 
Features with only the RP and RD statistical features 
and threshold 6 = 0.5. Any domain name that passes 
this filter (i.e., with a ““malware-related” label) then re- 
quires additional feature computation, i.e., reconstruct- 
ing the resolved IP address records, and further classi- 
fication at the next step in the sequential process. On 
the other hand, domains that are dropped by this filter 
(i.e., with a “legitimate” label) are no longer analyzed by 
Kopis. Thus, the first step filter is essentially a data re- 
duction tool, and the sequential classification process is 
a way to delay the expensive computation until the data 
volume is reduced. This technique is very important at 
the TLD level given the potentially huge volume of data. 

In our experiments Kopis operating at the first step 
with RP+RD Features (and threshold 6 = 0.5) 
yielded an average data reduction rate? of 87.95% on 


5We define the reduction rate as follows: 1 - 


TP, FP : , 
PPmatwaret¥Pmalware where TPmalware is the true posi- 
tives for the malware-related class, F Pmaiware is the mis-classified 
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Figure 13: F’'P,ates achieved during evaluation of traffic 
obtained from .ca TLD. 


the original dataset. After this reduction, at the second 
step, we evaluated Kopis on the (remaining) dataset us- 
ing all features, and keeping the same threshold 6 = 0.5. 
The average F’P,ates reported at this step by Kopis were 
zero while the average T’P,ates were 94.44%. The over- 
all F P-ates and T’P,atcs for this “in-series” mode were 
zero and 60.09% (1,321 domain names), respectively. 

At this point we should note that the threshold @ was 
set again with the intention to have the F’P,ates as close 
to 1.0% as possible but also not to sacrifice much of the 
TPrate produced from the first classification process in 
the “in-series” mode. As we saw previously, even when 
we had some FPs created by the RP+RD Features 
(the first classification process in the “‘in-series” mode), 
the combination of statistical features in the second “in- 
series” mode was able to prune away these FPs. An op- 
erator may choose to lower the threshold 6 even more 
and have as an immediate effect, the increase of domain 
names that will be forwarded to the second “in-series” 
classification process, with a potential increase in the 
overall T’Prate and F'P,aics. The experiments in this 
section showed that by using an “in-series” classification 
process where different steps can use different (sub)sets 
of features and thresholds, Kopis can achieve a good bal- 
ance of detection performance and operation efficiency 
at the TLD level. 


5.7. DDos Botnet Originated in China 


As discussed in Section 1, Kopis was designed to have 
global visibility so that it can detect domains associ- 
ated with malware activities running in an uncooperative 
country or networks before the attacks propagate to net- 


as malware-related benign domain names and ALL all as the domain 
names in the evaluation dataset. 
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Figure 14: Various growth trends for the DDoS botnet. 
Day zero is 03-20-2010. 


works that it protects. In this section, we report a case 
study to demonstrate Kopis’s global detection capability. 


Kopis was able to identify a commercial DDoS botnet 
in the first few weeks of its propagation in China and well 
before it began propagating within other countries, in- 
cluding the US. We alerted the security community, and 
the botnet was finally removed from the Internet in the 
middle of September 2010. Next we provide some in- 
tuition behind this discovery and why Kopis was able to 
detect this threat early. 


This DDoS botnet was controlled through 18 domain 
names, all of which were registered by the attacker under 
the same authority (although with different 2LDs). Kopis 
was deployed at the AuthNS server and was able to ob- 
serve resolution requests to these domains (even when 
the infected machines were initially not in the US) and 
classify them as malware-related because their resolution 
patterns fit the profiles of known malware domains in its 
knowledge base. 


These domain names were linked with six IP addresses 
located in the following autonomous systems: 14745 
(US), two in 4837 (CN), 37943 (CN) and two in 4134 
(CN), throughout the lifetime of the botnet. We show 
the difference between the absolute DNS lookups ver- 
sus the daily volume of unique query tuples in Figure 14 
(i). The average lookup volume every day was 438,471 
with average de-duplicated query tuples in the range of 
3,883. Despite this significant data reduction, Kopis was 
still able to track and identify this emerging threat. In 
Figures 14 (ii), (iii) and (iv), we can see the daily growth 
of unique CIDRs, AS and CCs related to the RDNSs that 
queried the domain names used in the botnet. 

An interesting observation can be made from Fig- 
ure 15. In this figure we can see the daily lookup volume 
for the domain names of this botnet. Instantly we can see 
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Figure 15: A snapshot from the first 70 days of the bot- 
net’s growth with respect to the country code-based res- 
olution attempts for the DDoS botnet’s domain names. 
Day zero is 03-20-2010. 


that the first big infection happened in Chinese networks 
in a relatively short period of time (in the first 2-3 days). 
After this initial infection, a number of machines from 
several other countries were also infected but nowhere 
close to the volume of the infected population in the Chi- 
nese networks. As an example we can see in Figure 15 
that the first time more than 1,000 daily lookups were 
observed from the United States was more than 20 days 
after the botnet was launched. Also, other countries such 
as Poland and Thailand had the first infection 21 and 25 
days after the botnet were lunched. Furthermore, large 
countries such as Italy, Spain and India reached the 100 
daily lookup threshold 15 days later than the start of this 
botnet. Clearly, for countries like Poland and Thailand 
(and even Italy, Spain and India to a large extent) local- 
ized DNS reputation techniques could not have been able 
to observe a resolution request (or a strong enough sig- 
nal) for any of the domain names related to this botnet, 
until the botnet had reached global scale, which was sev- 
eral weeks after it was launched. Figure 16 shows the 
volume of samples correlated with this botnet as they 
appeared in our malware feeds. We observe that the 
first malware sample related to this botnet appeared two 
months after the botnet became active. 


To demonstrate the contribution of each feature fam- 
ily towards the identification of the domain names that 
were part of this botnet we conducted the following ex- 
periment. We trained Kopis with 30 days of data before 
the 5” of May 2010. Then we computed vectors for 
all the domain names that were part of the botnet. We 
computed one vector every day for each domain name 
based on the information we had on the domain name and 
IP address up until that day. We classified each vector 
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against four trained classifiers with the following set of 
features; All Features, RD+RP Features, IPR 
Features, and RD+RP+AS (IPR) Features. We 
then marked the first day that each classifier detected 
a domain name as malware-related, while setting the 
threshold 6 = 0.9. By doing so we identified the ear- 
liest day that the classifier would have detected the do- 
main name without human forensic analysis on the re- 
sults. The detection results from this experiment can be 
found in Table 1. 

What the results show is that only the combination of 
all features can detect all the domain names until the end 
of August. On the other hand the IPR and the combina- 
tion of RD+RP features detected more than half of the 
domain names by the middle of July, when the botnet 
was in its peak. We should also note that in the middle 
of July we saw the biggest volume of malware samples 
related to the botnet’s domain names surfacing in the se- 
curity community. Finally, we should also note that these 
18 domain names appeared in public blacklists after the 
take-down of the botnet was publicly disclosed (Septem- 
ber 2010). Obviously, this was not exactly how we de- 
tected the botnet. After the initial identification of the 7 
domain names in the beginning of May and with some 
very basic forensic analysis, we managed to quickly dis- 
cover the entire corpus of the related domains. 





In an effort to place Kopis’ early detection abilities 
in comparison with recursive-based reputation systems 
(like Notos and Exposure) we check in the passive DNS 
database at ISC when these 18 domain names first ap- 
peared. Fifteen of them never showed up in the RDNSs 
that supply ISC with DNS data. The remaining three do- 
main appeared for the first time on the following dates: 
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2010-06-24 06:56:34, 2010-07-01 14:06:47 and 2010- 
09-08 04:32:36. This means that the first domain name 
related with this botnet appeared three months after the 
botnet was created and this would have been the earliest 
possible time that either Notos or Exposure could have 
detected these domain names assuming they were oper- 
ating on passive DNS data from ISC — one of biggest 
passive DNS repositories worldwide. This clearly shows 
the need of detection systems like Kopis that can operate 
higher in the DNS hierarchy and provide Internet with an 
early global warning system for DNS. 














Features/Dates 5/20 6/1 TIS 8/31 
All 7 9 15 18 
RD+RP 3 5 12 16 

IPR 3 5 13 17 
RD+RP+AS (IPR) 3 5 12 16 





Table 1: Number of the botnet related domain names 
that each feature family would have detected up-until the 
specified date assuming that the system was operating 
unsupervised. 


6 Discussion 


In this section, we elaborate on possible evasion tech- 
niques and discuss some operational issues of Kopis. 


6.1 Evasion techniques 


Kopis relies significantly on the Requester Diversity 
(RD) and Requester Profile features. An attacker may 
attempt to dilute the information provided by the RP and 
RD features to evade Kopis. This could be achieved by 
resolving domain names from a diverse set of open re- 
cursive DNS servers or even from random IPs acting as 
stub resolver (e.g., using infected machines). This will 
not be as easy as it sounds, due to the RP feature family. 
This is because even if the adversary looks up domain 
names from various different IP addresses, the adversary 
will still have to look up a large number of domain names 
under the same authority to make the weight of each re- 
quester large enough to alter the RP features. Addition- 
ally, the adversary will have to repeatedly (for a long 
enough period of time) ask for different domain names 
served by the same authority in order to influence/dilute 
the RDNS weighing function. 

In order to be able to artificially create the necessary 
signal that may dilute or even disturb the modeling of le- 
gitimate and malware-related domain names, the adver- 
sary would have to obtain access to traffic at the author- 
ity name or TLD servers. Furthermore, the adversary 
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would need a full list of statistical feature values used 
from Kopis. Such an attack would be similar in spirit to 
polymorphic blending attacks [12]. We note here that re- 
liable and systematic access to DNS traffic at the author- 
itative or TLD level is extremely hard to obtain, since it 
would require the collaboration of the registrar that con- 
trols the AuthNS or the TLD servers. 

Domain name generation algorithms (DGAs) have 
been used by malware families (i.e., Conficker [20], 
Zeus/Murofet [23], Bobax [26], Torpig [27] etc.) in the 
last few years. The new seed of these DGAs has typically 
the periodicity of a day. This implies that domain names 
generated by DGAs (and under the zones Kopis moni- 
tors) will be active only for a small period of time (e.g., a 
day). Due to the daily observation period mandatory for 
Kopis to provide detection results, such malware-related 
domain names will be potentially inactive by the time 
they are reported by our detection system. Operating 
Kopis with smaller epochs (1.e., hourly granularity) could 
potential solve this problem. We leave the verification of 
this operation mode to future work. 


6.2 TLDs and Domain Registrars 


As we have already discussed, just observing the DNS 
resolution requests at the TLD level will not provide 
sufficient information for the system to reconstruct the 
IP addresses mapped with the queried domain names. 
There are several ways to resolve this issue. The sim- 
plest way to reconstruct the IP addresses for a given do- 
main name is to check a large passive DNS database. For 
the domains that are not replicated in the passive DNS 
database, we can use an active probing strategy to re- 
trieve the resolved IP addresses with little overhead. 

As a final classification heuristic, especially in the case 
of domain registrars, they can potentially combine Kopis 
with domain name registration information. Classifica- 
tion results from Kopis can be combined with domain 
name registration information (trivially accessible to do- 
main registrars) in order to further reduce FPs but also 
provide an additional correlation between domain regis- 
tration accounts that own domains with suspicious reso- 
lution behavior according to Kopis. 


7 Conclusion 


In this paper, we presented Kopis, a system that can op- 
erate at the upper DNS hierarchy and detect malware- 
related domains based on global DNS resolution pat- 
terns. To the best of our knowledge, Kopis is the first sys- 
tem that can operate at TLD servers and large authorities 
and provide DNS operators the ability of early detection 
of malware-related domains — even without information 
of the associated malware. 
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Kopis models three key signals at the DNS authori- 
ties: the daily domain name resolution patterns, the sig- 
nificance of each requester for an epoch, and the do- 
main name’s IP address reputation. Using more than 
half a year of real world data of known benign and 
malware-related domains from two major DNS authori- 
ties and the . ca TLD, our evaluation showed that Kopis 
can achieve high T’Prates (98.4% against all malware- 
related domains and 73.6% against new and previously 
unclassified malware-related domains) and low F'P,-ates 
(0.3% and 0.5%). Kopis was also able to detect newly 
created and previously unclassified malware-related do- 
main names several weeks before they were listed in any 
blacklist and before information of the associated mal- 
ware appeared in security forums. Finally, Kopis was 
used to identify the creation of a DDoS botnet in China. 
This ability to identify malware-related domains on the 
rise can provide the DNS operators the preemptive abil- 
ity to remove rapidly growing botnets at the very early 
stage, thus minimizing their threats to Internet security. 
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Abstract 

Unsolicited bulk email (spam) is used by cyber- 
criminals to lure users into scams and to spread mal- 
ware infections. Most of these unwanted messages are 
sent by spam botnets, which are networks of compro- 
mised machines under the control of a single (malicious) 
entity. Often, these botnets are rented out to particular 
groups to carry out spam campaigns, in which similar 
mail messages are sent to a large group of Internet users 
ina short amount of time. Tracking the bot-infected hosts 
that participate in spam campaigns, and attributing these 
hosts to spam botnets that are active on the Internet, are 
challenging but important tasks. In particular, this infor- 
mation can improve blacklist-based spam defenses and 
guide botnet mitigation efforts. 

In this paper, we present a novel technique to support 
the identification and tracking of bots that send spam. 
Our technique takes as input an initial set of IP addresses 
that are known to be associated with spam bots, and 
learns their spamming behavior. This initial set is then 
“magnified” by analyzing large-scale mail delivery logs 
to identify other hosts on the Internet whose behavior is 
similar to the behavior previously modeled. We imple- 
mented our technique in a tool, called BOTMAGNIFIER, 
and applied it to several data streams related to the deliv- 
ery of email traffic. Our results show that it is possible 
to identify and track a substantial number of spam bots 
by using our magnification technique. We also perform 
attribution of the identified spam hosts and track the evo- 
lution and activity of well-known spamming botnets over 
time. Moreover, we show that our results can help to im- 
prove state-of-the-art spam blacklists. 


1 Introduction 


Email spam is one of the open problems in the area of 
IT security, and has attracted a significant amount of 
research over many years [11, 26, 28, 40, 42]. Unso- 
licited bulk email messages account for almost 90% of 
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the world-wide email traffic [20], and a lucrative busi- 
ness has emerged around them [12]. The content of spam 
emails lures users into scams, promises to sell cheap 
goods and pharmaceutical products, and spreads mali- 
cious software by distributing links to websites that per- 
form drive-by download attacks [24]. 


Recent studies indicate that, nowadays, about 85% of 
the overall spam traffic on the Internet is sent with the 
help of spamming botnets [20,36]. Botnets are networks 
of compromised machines under the direction of a sin- 
gle entity, the so-called botmaster. While different bot- 
nets serve different, nefarious goals, one important pur- 
pose of botnets is the distribution of spam emails. The 
reason is that botnets provide two advantages for spam- 
mers. First, a botnet serves as a convenient infrastructure 
for sending out large quantities of messages; it is essen- 
tially a large, distributed computing system with mas- 
sive bandwidth. A botmaster can send out tens of mil- 
lions of emails within a few hours using thousands of 
infected machines. Second, a botnet allows an attacker 
to evade spam filtering techniques based on the sender 
IP addresses. The reason is that the IP addresses of some 
infected machines change frequently (e.g., due to the ex- 
piration of a DHCP lease, or to the change in network 
location in the case of an infected portable computer). 
Moreover, it is easy to infect machines and recruit them 
as new members into a botnet. This means that black- 
lists need to be updated constantly by tracking the IP ad- 
dresses of spamming bots. 

Tracking spambots is challenging. One approach to 
detect infected machines is to set up spam traps. These 
are fake email addresses (i.e., addresses not associated 
with real users) that are published throughout the Inter- 
net with the purpose of attracting and collecting spam 
messages. By extracting the sender IP addresses from 
the emails received by a spam trap, it is possible to ob- 
tain a list of bot-infected machines. However, this ap- 
proach faces two main problems. First, it is likely that 
only a subset of the bots belonging to a certain botnet 
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will send emails to the spam trap addresses. Therefore, 
the analysis of the messages collected by the spam trap 
can provide only a partial view of the activity of the bot- 
net. Second, some botnets might only target users lo- 
cated in a specific country (e.g., due to the language used 
in the email), and thus a spam trap located in a different 
country would not observe those bots. 


Other approaches to identify the hosts that are part of 
a spamming botnet are specific to particular botnets. For 
example, by taking control of the command & control 
(C&C) component of a botnet [21,26], or by analyzing 
the communication protocol used by the bots to interact 
with other components of the infrastructure [6, 15, 32], 
it is possible to enumerate (a subset of) the IP addresses 
of the hosts that are part of a botnet. However, in these 
cases, the results are specific to the particular botnet that 
is being targeted (and, typically, the type of C&C used). 


In this paper, we present a novel approach to identify 
and track spambot populations on the Internet. Our am- 
bitious goal is to track the IP addresses of all active hosts 
that belong to every spamming botnet. By active hosts, 
we mean hosts that are online and that participate in spam 
campaigns. Comprehensive tracking of the IP addresses 
belonging to spamming botnets is useful for several rea- 
sons: 


e Internet Service Providers can take countermea- 
sures to prevent the bots whose IP addresses reside 
in their networks from sending out email messages. 

e Organizations can clean up compromised machines 
in their networks. 

e Existing blacklists and systems that analyze 
network-level features of emails can be improved 
by providing accurate information about machines 
that are currently sending out spam emails. 

e By monitoring the number of bots that are part of 
different botnets, it is possible to guide and support 
mitigation efforts so that the C&C infrastructures 
of the largest, most aggressive, or fastest-growing 
botnets are targeted first. 


Our approach to tracking spamming bots is based on 
the following insight: bots that belong to the same bot- 
net share the same C&C infrastructure and the same code 
base. As a result, these bots will feature similar behavior 
when sending spam [9, 40,41]. In contrast, bots belong- 
ing to different spamming botnets will typically use dif- 
ferent parameters for sending spam mails (e.g., the size 
of the target email address list, the domains or countries 
that are targeted, the spam contents, or the timing of their 
actions). More precisely, we leverage the fact that bots 
(of a particular botnet) that participate in a spam cam- 
paign share similarities in the destinations (domains) that 
they target and in the time periods they are active. Simi- 
lar to previous work [15], we consider a spam campaign 
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to be a set of email messages that share a substantial 
amount of content and structure (e.g., a spam campaign 
might involve the distribution of messages that promote 
a specific pharmaceutical scam). 


Input datasets. Ata high level, our approach takes two 
datasets as input. The first dataset contains the IP ad- 
dresses of known spamming bots that are active during 
a certain time period (we call this time period the obser- 
vation period). The IP addresses are grouped by spam 
campaign. That is, IP addresses in the same group sent 
the same type of messages. We refer to these groups of 
IP addresses as seed pools. The second dataset is a log 
of email transactions carried out on the Internet during 
the same time period. This log, called the transaction 
log, contains entries that specify that, at a certain time, 
IP address C’ attempted to send an email message to IP 
address S. The log does not need to be a complete log 
of every email transaction on the Internet (as it would be 
unfeasible to collect this information). However, as we 
will discuss later, our approach becomes more effective 
as this log becomes more comprehensive. 


Approach. In the first step of our approach, we search 
the transaction log for entries in which the sender IP ad- 
dress is one of the IP addresses in the seed pools (i.e., 
the known spambots). Then, we analyze these entries 
and generate a number of behavioral profiles that capture 
the way in which the hosts in the seed pools sent emails 
during the observation period. 

In the second step of the approach, the whole trans- 
action log is searched for patterns of behavior that are 
similar to the spambot behavior previously learned from 
the seed pools. The hosts that behave in a similar man- 
ner are flagged as possible spamming bots, and their IP 
addresses are added to the corresponding magnified pool. 

In the third and final step, heuristics are applied to re- 
duce false positives and to assign spam campaigns (and 
the IP addresses of bots) to specific botnets (e.g., Rus- 
tock [5], Cutwail [35], or MegaD [4, 6]). 

We implemented our approach in a tool, called BoT- 
MAGNIFIER. In order to populate our seed pools, we 
used data from a large spam trap set up by an Internet 
Service Provider (ISP). Our transaction logs were con- 
structed by running a mirror for Spamhaus, a popular 
DNS-based blacklist. Note that other sources of infor- 
mation can be used to either populate the seed pools or 
to build a transaction log. As we will show, BOTMAGNI- 
FIER also works for transaction logs extracted from net- 
flow data collected from a large ISP’s backbone routers. 

BOTMAGNIFIER is executed periodically, at the end 
of each observation period. It outputs a list of the IP ad- 
dresses of all bots in the magnified pools that were found 
during the most recent period. Moreover, BOTMAGNI- 
FIER associates with each seed and magnified pool a la- 
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bel that identifies (when possible) the name of the botnet 
that carried out the corresponding spam campaign. Our 
experimental results show that our system can find a sig- 
nificant number of additional IP addresses compared to 
the seed baseline. Furthermore, BOTMAGNIFIER is able 
to detect emerging spamming botnets. As we will show, 
we identified the resurrection of the Waledac spam botnet 
during the evaluation period, demonstrating the ability of 
our technique to find new botnets. 

In summary, we provide the following contributions: 


e We developed a novel method for characterizing the 
behavior of spamming bots. 

e We provide a novel technique for identifying and 
tracking spamming bot populations on the Internet, 
using a “magnification” process. 

e We assigned spam campaigns to the major botnets, 
and we studied the evolution of the bot population 
of these botnets over time. 

e We validated our results using ground truth col- 
lected from a number of C&C servers used by a 
large spamming botnet, and we demonstrated the 
applicability of our technique to real-world, large- 
scale datasets. 


2 Input Datasets 


BOTMAGNIFIER requires two input datasets to track 
spambots: seed pools and a transaction log. In this sec- 
tion, we discuss how these two datasets are obtained. 


2.1 Seed Pools 


A seed pool is a set of IP addresses of hosts that, during 
the most recent observation period, participated in a spe- 
cific spam campaign. The underlying assumption is that 
the hosts whose IP addresses are in the same seed pool 
are part of the same spamming botnet, and they were in- 
structed to send a certain batch of messages (e.g., emails 
advertising cheap Viagra or replica watches). 

To generate the seed pools for the various spam cam- 
paigns, we took advantage of the information collected 
by a spam trap set up by a large US ISP. Since the email 
addresses used in this spam trap do not correspond to 
real customers, all the received emails are spam. We 
collected data from the spam trap between September 1, 
2010 and February 10, 2011, with a downtime of about 
15 days in November 2011. The spam trap collected, on 
average, 924,000 spam messages from 268,000 IP ad- 
dresses every day. 


Identifying similar messages. We identify spam cam- 
paigns within this dataset by looking for similar email 
messages. More precisely, we analyze the subject lines 
of all spam messages received during the last observation 
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period (currently one day: see discussion below). Mes- 
sages that share a similar subject line are considered to 
be part of the same campaign (during this period). 

Unfortunately, the subject lines of messages of a cer- 
tain campaign are typically not identical. In fact, most 
botnets vary the subject lines of the message they send 
to avoid detection by anti-spam systems. For exam- 
ple, some botnets put the user name of the recipient 
in the subject, or change the price of the pills be- 
ing sold in drug-related campaigns. To mitigate this 
problem, we extract templates from the actual subject 
lines. To this end, we substitute user names, email ad- 
dresses, and numbers with placeholder regular expres- 
sions. User names are recognized as tokens that are 
identical to the first part of the destination email address 
(the part to the left of the @ sign). For example, the 
subject line “john, get 90% discounts!” sent 
to user john@example.com becomes “\wt+, get 
[0-9]+% discounts!” 

More sophisticated botnets, such as Rustock, add ran- 
dom text fetched from Wikipedia to both the email body 
and the subject line. Other botnets, such as Lethic, add 
a random word at the end of each subject. These tricks 
make it harder to group emails belonging to the same 
campaign that are sent by different bots, because differ- 
ent bots will add distinct text to each message. To handle 
this problem, we developed a set of custom rules for the 
largest spamming botnets that remove the spurious con- 
tent from the subject lines. 

Once the subjects of the messages have been trans- 
formed into templates and the spurious information has 
been removed, messages with the same template subject 
line are clustered together. This approach is less sophis- 
ticated than methods that take into account more features 
of the spam messages [22,40], but we found (by manual 
investigation) that our simple approach was very effec- 
tive for our purpose. Our approach, although sufficient, 
could be refined even further by incorporating these more 
sophisticated schemes to improve our ability to recognize 
spam campaigns. 

Once the messages are clustered, the IP addresses of 
the senders in each cluster are extracted. These sets of IP 
addresses represent the seed pools that are used as input 
to our magnification technique. 


Seed pool size. During our experiments, we found that 
seed pools that contain a very small number of IP ad- 
dresses do not provide good results. The reason is that 
the behavior patterns that can be constructed from only a 
few known bot instances are not precise enough to rep- 
resent the activity of a botnet. For example, campaigns 
involving 200 unique IP addresses in the seed pool pro- 
duced, on average, magnified sets where 60% of the IP 
addresses were not listed in Spamhaus, and therefore 


20th USENIX Security Symposium 429 


430 


were likely legitimate servers. Similarly, campaigns with 
a seed pool size of 500 IP addresses still produced mag- 
nified sets where 25% of the IP addresses were marked 
as legitimate by Spamhaus. For these reasons, we only 
consider those campaigns for which we have observed 
more than 1,000 unique sender IP addresses. The emails 
belonging to these campaigns account for roughly 84% 
of the overall traffic observed by our spam trap. It is in- 
teresting to notice that 8% of the overall traffic belongs 
to campaigns carried out by less than 10 distinct IP ad- 
dresses per day. Such campaigns are carried out by ded- 
icated servers and abused email service providers. The 
aggressive spam behavior of these servers and their lack 
of geographic/IP diversity makes them trivial to detect 
without the need for magnification. 


The lower limit on the size of seed pools has implica- 
tions for the length of the observation period. When this 
interval is too short, the seed pools are likely to be too 
small. On the other hand, many campaigns last less than 
a few hours. Thus, it is not useful to make the observa- 
tion period too long. Also, when increasing the length 
of the observation period, there is a delay introduced be- 
fore BOTMAGNIFIER can identify new spam hosts. This 
is not desirable when the output is used for improving 
spam defenses. In practice, we found that an observation 
period of one day allows us to generate sufficiently large 
seed pools from the available spam feed. To evaluate 
the impact that the choice of the analysis period might 
have on our analysis system, we looked at the length of 
100 spam campaigns, detected over a period of one day. 
The average length of these campaigns is 9 hours, with a 
standard deviation of 6 hours. Of the campaigns we ana- 
lyzed, 25 lasted less than four hours. However, only two 
of these campaigns did not generate large enough seed 
pools to be considered by BOTMAGNIFIER. On the other 
hand, 8 campaigns that lasted more than 18 hours would 
not have generated large enough seed pools if we used 
a shorter observation period. Also, by manual investi- 
gation, we found that campaigns that last more than one 
day typically reach the threshold of 1,000 IP addresses 
for their seed pool within the first day. Therefore, we be- 
lieve that the choice of an observation period of one day 
works well, given the characteristics of the transaction 
log we used. Of course, if the volume of either the seed 
pools or the transaction log increased, the observation 
period could be reduced accordingly, making the system 
more effective for real-time spam blacklisting. 


Note that it is not a problem when a spam campaign 
spans multiple observation periods. In this case, the 
bots that participate in this spam campaign and are ac- 
tive during multiple periods are simply included in mul- 
tiple seed pools (one for each observation period for this 
campaign). 
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2.2 Transaction Log 


The transaction log is a record of email transactions car- 
ried out on the Internet during the same time period 
used for the generation of the seed pools. For the cur- 
rent version of BOTMAGNIFIER and the majority of our 
experiments, we obtained the transaction log by ana- 
lyzing the queries to a mirror of Spamhaus, a widely- 
used DNS-based blacklisting service (DNSBL). When 
an email server S is contacted by a client C’ that wants 
to send an email message, server S contacts one of the 
Spamhaus mirrors and asks whether the IP address of the 
client C’ is a known spam host. If C' is a known spam- 
mer, the connection is rejected or the email is marked as 
spam. 

Each query to Spamhaus contains the IP address of 
C. It is possible that S may not query Spamhaus di- 
rectly. In some cases, S is configured to use a local DNS 
server that forwards the query. In such cases, we would 
mistakenly consider the IP address of the DNS server 
as the mail server. However, the actual value of the IP 
address of S is not important for the subsequent analy- 
sis. It is only important to recognize when two different 
clients send email to the same server S. Thus, as long 
as emails sent to server S yield Spamhaus queries that 
always come from the same IP address, our technique is 
not affected. 

Each query generates an entry in the transaction log. 
More precisely, the entry contains a timestamp, the IP 
address of the sender of the message, and the IP address 
of the server issuing the query. Of course, by monitoring 
a single Spamhaus mirror (out of 60 deployed throughout 
the Internet), we can observe only a small fraction of the 
global email transactions. Our mirror observes roughly 
one hundred million email transactions a day, compared 
to estimates that put the number of emails sent daily at 
hundreds of billions [13]. 

Note that even though Spamhaus is a blacklisting ser- 
vice, we do not use the information it provides about the 
blacklisted hosts to perform our analysis. Instead, we 
use the Spamhaus mirror only to collect the transaction 
logs, regardless of the fact that a sender may be a known 
spammer. In fact, other sources of information can be 
used to either populate the seed pools or to collect the 
transaction log. To demonstrate this, we also ran BOT- 
MAGNIFIER on transaction logs extracted from netflow 
data collected from a number of backbone routers of a 
large ISP. The results show that our general approach is 
still valid (see Section 6.4 for details). 


3 Characterizing Bot Behavior 


Given the two input datasets described in the previous 
section, the first step of our approach is to extract the be- 
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havior of known spambots. To this end, the transaction 
log is consulted. More precisely, for each seed pool, we 
query the transaction log to find all events that are associ- 
ated with all of the IP addresses in that seed pool (recall 
that the IP addresses in a seed pool correspond to known 
spambots). Here, an event is an entry in the transaction 
log where the known spambot is the sender of an email. 
Essentially, we extract all the instances in the transaction 
log where a known bot has sent an email. 

Once the transaction log entries associated with a seed 
pool are extracted, we analyze the destinations of the 
spam messages to characterize the bots’ behavior. That 
is, the behavior of the bots in a seed pool is characterized 
by the set of destination IP addresses that received spam 
messages. We call the set of server IP addresses targeted 
by the bots in a seed pool this pool’s target set. 

The reason for extracting a seed pool’s target set is the 
insight that bots belonging to the same botnet receive the 
same list of email addresses to spam, or, at least, a subset 
of addresses belonging to the same list. Therefore, dur- 
ing their spamming activity, bots belonging to botnet A 
will target the addresses contained in list L.4, while bots 
belonging to botnet B will target destinations belonging 
to list Ly. That is, the targets of a spam campaign char- 
acterize the activity of a botnet. 

Unfortunately, the target sets of two botnets often have 
substantial overlap. The reason is that there are many 
popular destinations (server addresses) that are targeted 
by most botnets (e.g., the email servers of Google, Ya- 
hoo, large ISPs with many users, etc.) Therefore, we 
want to derive, for each spam campaign (seed pool), the 
most characterizing set of destination IP addresses. To 
this end, we remove from each pool’s target set all server 
IP addresses that appear in any target set belonging to 
another another seed pool. 

More precisely, consider the seed pools P = 
P1;P2,;--+,;Pn- Each pool p; stores the IP addresses 
of known bots that participated in a certain campaign: 
11, 12,--+,%m- In addition, consider that the transaction 
log L contains entries in the form (t, is, ia), where t is a 
time stamp, 7, is the IP address of the sender of an email 
and i, is the IP address of the destination server of an 
email. For each seed pool p;, we build this seed pool’s 
target set T'(p;) as follows: 


T(p;) = {ta|(t,4s, 44) € LAts € py}. (1) 


Then, we compute the characterizing set C(p,) of a 
seed pool p; as follows: 


C(p;) = {ialia € T(pi) Ata € T(pj),F At}. (2) 


As a result, C'(p;) contains only the target addresses 
that are unique (characteristic) for the destinations of 
bots in seed pool p;. The characterizing set C(p;) of 
each pool is the input to the next step of our approach. 
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4 Bot Magnification 


The goal of the bot magnification step is to find the IP ad- 
dresses of additional, previously-unknown bots that have 
participated in a known spam campaign. More precisely, 
the goal of this step is to search the transaction log for IP 
addresses that behave similarly to the bots in a seed pool 
p;. If such matches can be found, the corresponding IP 
addresses are added to the magnification set associated 
with p;. This means that a magnification set stores the IP 
addresses of additional, previously-unknown bots. 

BOTMAGNIFIER considers an IP address x; that ap- 
pears in the transaction log L as matching the behavior 
of a certain seed pool p,; (and, thus, belonging to that 
spam campaign) if the following three conditions hold: 
(i) host x; sent emails to at least NV destinations in the 
seed pool’s target set T'(p;); (ii) the host never sent an 
email to a destination that does not belong to that target 
set; (iii) host x; has contacted at least one destination that 
is unique for seed pool p; (i.e., an address in C(p;)). If 
all three conditions are met, then IP address x; is added 
to the magnification set M(p;) of seed pool p;. 

More formally, if we define D(«;) as the set of desti- 
nations targeted by an IP address x;, we have: 

a; € M(pi) 


= |D(a)NT(p)|=>NA 


D(zxi) © T(pi) A 
D(a) NC(pi) 4 0. (3) 


The intuition behind this approach is the following: 
when a host h sends a reasonably large number of emails 
to the same destinations that were targeted by a spam 
campaign and not to any other targets, there is a strong 
indication that the email activity of this host is similar to 
the bots involved in the campaign. Moreover, to assign 
a host h to at most one campaign (the one that it is most 
similar), we require that h targets at least one unique des- 
tination of this campaign. 


Threshold computation. The main challenge in this 
step is to determine an appropriate value for the thresh- 
old N, which captures the minimum number of destina- 
tion IP addresses in T'(p;) that a host must send emails to 
in order to be added to the magnification set M(p;). Set- 
ting N to a value that is too low will generate too many 
bot candidates, including legitimate email servers, and 
the tool would generate many false positives. Setting Nv 
to a value that is too high might discard many bots that 
should have been included in the magnification set (that 
is, the approach generates many false negatives). This 
trade-off between false positives and false negatives is a 
problem that appears in many security contexts, for ex- 
ample, when building models for intrusion detection. 
An additional, important consideration for the proper 
choice of N is the size of the target set |T(p;)|. Intu- 
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Figure 1: Quality of magnification for varying k using 
ten Cutwail campaigns of different sizes. 


itively, we expect that NV should be larger when the size 
of the target set increases. This is because a larger target 
set increases the chance that a random, legitimate email 
sender hits a sufficient number of targets by accident, and 
hence, will be incorrectly included into the magnification 
set. In contrast, bots carrying out a spam campaign that 
targets only a small number of destinations are easier to 
detect. The reason is that as soon as a legitimate email 
sender sends an email to a server that is not in the set tar- 
geted by the campaign, it will be immediately discarded 
by our magnification algorithm. Therefore, we represent 
the relationship between the threshold N and the size of 
the target set |T'(p;)| as: 


N=k-|T(pj)|,0<k <1, (4) 


where k is a parameter. Ideally, the relation between N 
and |T'(p;)| would be linear, and & will have a constant 
value. However, as will be clear from the discussion be- 
low, k also varies with the size of |T'(p;)]. 

To determine a good value for & and, as a consequence, 
select a proper threshold NV, we performed an analysis 
based on ground truth about the actual IP addresses in- 
volved in several spam campaigns. This information was 
collected from the takedown of more than a dozen C&C 
servers used by the Cutwail spam botnet. More specifi- 
cally, each server stored comprehensive records (e.g., tar- 
get email lists, bot IP addresses, etc.) about spam activi- 
ties for a number of different campaigns [35] 

In particular, we applied BOTMAGNIFIER to ten Cut- 
wail campaigns, extracted from two different C&C 
servers. We used these ten campaigns since we had a 
precise view of the IP addresses of the bots that sent the 
emails. For the experiment, we varied the value for N in 
the magnification process from 0 to 300. This analysis 
yielded different magnification sets for each campaign. 
Then, using our knowledge about the actual bots B that 
were part of each campaign, we computed the precision 
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P and recall R values for each threshold setting. Since 
we want to express the quality of the magnification pro- 
cess as a function of k, independently of the size of a 
campaign, we use Equation 4 to get k = Fon 

The precision value P(k) represents what fraction of 
the IP addresses that we obtain as candidates for the mag- 
nification set for a given k are actually among the ground 
truth IP addresses. The recall value R(k), on the other 
hand, tells us what fraction of the total bot set B is identi- 
fied. Intuitively, a low value of k will produce high R(k), 
but low P(k). When we increase k, P(k) will increase, 
but R(k) will decrease. Optimally, both precision and 
recall are high. Thus, for our analysis, we use the prod- 
uct PR(k) = P(k) - R(k) to characterize the quality 
of the magnification step. Figure 1 shows how PR(k) 
varies for different values of &. As shown for each cam- 
paign, PR(k) first increases, then stays relatively level, 
and then starts to decrease. 

The results indicate that k is not a constant, but varies 
with the size of |T'(p,)|. In particular, small campaigns 
have a higher optimal value for k compared to larger 
campaigns: as |T(p;)| increases, the value of & slowly 
decreases. To reflect this observation, we use the follow- 
ing, simple way to compute k: 


a 

8+ Tre - 
where ky is a constant value, a is a parameter, and 
|T(p;)| is the number of destinations that a campaign 
targeted. The parameters ky, and a are determined so 
that the quality of the magnification step PR is maxi- 
mized for a given ground truth dataset. Using the Cut- 
wail campaigns as the dataset, this yields k, = 8 - 10-4 
and a = 10. 

Our experimental results show that these parameter 
settings yield good results for a wide range of campaigns, 
carried out by several different botnets. This is because 
the magnification process is robust and not dependent 
on an optimal threshold selection. We found that non- 
optimal thresholds typically tend to decrease recall. That 
is, the magnification process does not find all bots that it 
could possibly detect, but false positives are limited. In 
Section 6.4, we show how the equation of k, with the val- 
ues we determined for parameters ky and a, yields good 
results for any campaign magnified from our Spamhaus 
dataset. We also show that the computation of k can be 
performed in the same way for different types of trans- 
action logs. To this end, we study how BOTMAGNIFIER 
can be used to analyze netflow records. 


5 Spam Attribution 


Once the magnification process has completed, we merge 
the IP addresses from the seed pool and the magnifica- 
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tion set to obtain a campaign set. We then apply several 
heuristics to reduce false positives and to assign the dif- 
ferent campaign sets to specific botnets. Note that the 
labeling of the campaign sets does not affect the results 
of the bot magnification process. BOTMAGNIFIER could 
be used in the wild for bot detection without these at- 
tribution functionalities. It is relevant only for tracking 
the populations of known botnets, as we discuss in Sec- 
tion 6.2. 


5.1 Spambot Analysis Environment 


The goal of this phase is to understand the behavior of 
current spamming botnets. That is, we want to determine 
the types of spam messages sent by a specific botnet at 
a certain point in time. To this end, we have built an 
environment that enables us to execute bot binaries in a 
controlled setup similarly to previous studies [11,39]. 


Our spambot analysis environment is composed of one 
physical system hosting several virtual machines (VMs), 
each of which executes one bot binary. The VMs have 
full network access so that the bots can connect to the 
C&C server and receive spam-related configuration data, 
such as spam templates or batches of email addresses to 
which spam should be sent. However, we make sure that 
no actual spam emails are sent out by sinkholing spam 
traffic, i.e., we redirect outgoing emails to a mail server 
under our control. This server is configured to record 
the messages, without relaying them to the actual des- 
tination. We also prevent other kinds of malicious traf- 
fic (e.g., scanning or exploitation attempts) through vari- 
ous firewall rules. Some botnets (e.g., MegaD) use TCP 
port 25 for C&C traffic, and, therefore, we need to make 
sure that such bots can still access the C&C server. This 
is implemented by firewall rules that allow C&C traffic 
through, but prevent outgoing spam. Furthermore, bot- 
nets such as Rustock detect the presence of a virtual en- 
vironment and refuse to run. Such samples are executed 
on a physical machine configured with the same network 
restrictions. To study whether bots located in different 
countries show a unique behavior, we run each sample 
at two distinct locations: one analysis environment is lo- 
cated in the United States, while the other one is located 
in Europe. In our experience, this setup enables us to re- 
liably execute known spambots and observe their current 
spamming behavior. 


For this study, we analyzed the five different bot fam- 
ilies that were the most active during the time of our 
experiments: Rustock [5], Lethic, MegaD [4, 6], Cut- 
wail [35], and Waledac. We ran our samples from July 
2010 to February 2011. Some of the spambots we ran 
sent out spam emails for a limited amount of time (typi- 
cally, a couple of weeks), and then lost contact with their 
controllers. We periodically substituted such bots with 
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newer samples. Other bots (e.g., Rustock) were active 
for most of the analysis period. 


5.2 Botnet Tags 


After monitoring the spambots in a controlled environ- 
ment, we attempt to assign botnet labels to spam emails 
found in our spam trap. Therefore, we first extract the 
subject templates from the emails that were collected in 
the analysis environment with the same technique de- 
scribed in Section 2.1. Then, we compare the subject 
templates with the emails we received in the spam trap 
during that same day. If we find a match, we tag the 
campaign set that contains the IP address of the bot that 
sent the message with the corresponding botnet name. 
Otherwise, we keep the campaign set unlabeled. 


5.3 Botnet Clustering 


As noted above, we ran five spambot families in our anal- 
ysis environment. Of course, it is possible that one of the 
monitored botnets is carrying out more campaigns than 
those observed by analyzing the emails sent by the bots 
we execute in our analysis environment. In addition, we 
are limited by the fact that we cannot run all bot binaries 
in the general case (e.g., due to newly emerging botnets 
or in cases where we do not have access to a sample), 
and, thus, we cannot collect information about such cam- 
paigns. The overall effect of this limitation is that some 
campaign sets may be left unlabeled. 

The goal of the botnet clustering phase is to determine 
whether an unlabeled campaign set belongs to one of the 
botnets we monitored. If an unlabeled campaign set can- 
not be associated with one of the existing labeled cam- 
paign sets, then we try to see if it can be merged with 
another unlabeled campaign set, which, together, might 
represent a new botnet. 

In both cases, there is a need to determine if two cam- 
paign sets are “close” enough to each other in order to be 
considered as part of the same botnet. In order to repre- 
sent the distance between campaign sets, we developed 
three metrics, namely an IP overlap metric, a destination 
distance metric, and a bot distance metric. 


IP overlap. The observation underlying the IP overlap 
metric is that two campaign sets sharing a large number 
of bots (i.e., common IP addresses) likely belong to the 
same botnet. It is important to note that infected ma- 
chines can belong to multiple botnets, as one machine 
may be infected with two distinct instances of malware. 
Another factor one needs to take into account is network 
address translation (NAT) gateways, which can poten- 
tially hide large networks behind them. As a result, the 
IP address of a NAT gateway might appear as part of 
multiple botnets. However, a host is discarded from the 
campaign set related to p; as soon as it contacts a des- 
tination that is not in the target set (see Section 4 for a 
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discussion). Therefore, NAT gateways are likely to be 
discarded from the candidate set early on: at some point, 
machines behind the NAT will likely hit two destinations 
that are unique to two different seed pools, and, thus, 
will be discarded from all campaign sets. This might 
not be true for small NATs, with just a few hosts behind 
them. In this case, the IP address of the gateway would 
be detected as a bot by BOTMAGNIFIER. In a real world 
scenario, this would still be useful information for the 
network administrator, who would know what malware 
has likely infected one or more of her hosts. 

Given these assumptions, we merge two campaign sets 
with a large IP overlap. More precisely, first the intersec- 
tion of the two campaign sets is computed. Then, if such 
intersection represents a sufficiently high portion of the 
IP addresses in either of the campaign sets, the two cam- 
paign sets are merged. 

The fraction of IP addresses that need to match ei- 
ther of the campaign sets to consider them to be part of 
the same botnet varies with the size of the sets for those 
campaigns. Intuitively, two small campaigns will have to 
overlap by a larger percentage than two large campaigns 
in order to be considered as part of the same botnet. This 
is done to avoid merging small campaigns together just 
based on a small number of IP addresses that might be 
caused by multiple infections or by two different spam- 
bots hiding behind a small NAT. Given a campaign c, the 
fraction of IP addresses that has to overlap with another 
campaign in order to be merged together is 


1 
O. = ——:: 
login (Ne) 


where JV, is the number of hosts in the campaign set. We 
selected this equation because the denominator increases 
slowly with the number of bots carrying out a campaign. 
Moreover, because of the use of the logarithm, this equa- 
tion models an exponential decay, which decreases fast 
for small values of N., and much more slowly for large 
values of it. Applying this equation, a campaign carried 
out by 100 hosts will require an overlap of 50% or more 
to be merged with another one, while a campaign carried 
out by 10,000 hosts will only require an overlap of 25%. 
When comparing two campaigns c; and cz, we require 
the smaller one to have an overlap of at least O, with the 
largest one to consider them as being carried out by the 
same botnet. 


(6) 


Destination distance. This technique is an extension 
of our magnification step. We assume that bots carry- 
ing out the same campaign will target the same desti- 
nations. However, as mentioned previously, some bot- 
nets send spam only to specific countries during a given 
time frame. Leveraging this observation, it is possible to 
find out whether two campaign sets are likely carried out 
by the same botnet by observing the country distribution 
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of the set of destinations they targeted. More precisely, 
we build a destination country vector for each campaign 
set. Each element of the destination country vector cor- 
responds to the fraction of destinations that belong to a 
specific country. We determined the country of each IP 
address using the GEOIP tool [19]. Then, for each pair of 
campaign sets, we calculate the cosine distance between 
them. 

We performed a precision versus recall analysis to de- 
velop an optimal threshold for this clustering technique. 
By precision, we mean how well this technique can dis- 
criminate between campaigns belonging to different bot- 
nets. By recall, we capture how well the technique can 
cluster together campaigns carried out by the same bot- 
net. We ran our analysis on 50 manually-labeled cam- 
paigns picked from the ones sent by the spambots in our 
analysis environment. Similarly to how we found the 
optimal value of & in Section 4, we multiply precision 
and recall together. We then searched for the threshold 
value that maximizes this product. In our experiments, 
we found that the cosine distance of the destination coun- 
tries vectors is rarely lower than 0.8. This occurs regard- 
less of the particular country distribution of a campaign, 
because there will be a significant amount of bots in large 
countries (e.g., the United States or India). The precision 
versus recall analysis showed that 0.95 is a good thresh- 
old for this clustering technique. 


Bot distance. This technique is similar to the destina- 
tion distance, except that it utilizes the country distribu- 
tion of the bot population of the campaign set instead of 
the location of the targeted servers. For each campaign 
set, we build a source country vector that contains the 
fraction of bots for a given country. 

The intuition behind this technique comes from the 
fact that malware frequently propagates through mali- 
cious web sites, or through legitimate web servers that 
have been compromised [24, 34]. These sites will not 
have a uniform distribution of users (e.g., a Spanish 
web site will mostly have visitors from Spanish-speaking 
countries) and, therefore, the distribution of compro- 
mised users in the world for that site will not be uniform. 
For this technique, we also performed a precision ver- 
sus recall analysis, in the same way as for the destination 
distance technique. Again, we experimentally found the 
optimal threshold to be 0.95. 


6 Evaluation 


To demonstrate the validity of our approach, we first ex- 
amined the results generated by BOTMAGNIFIER when 
magnifying the population of a large spamming botnet 
for which we have ground truth knowledge (i.e., we 
know which IP addresses belong to the botnet). Then, 
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we ran the system for a period of four months on a large 
set of real-world data, and we successfully tracked the 
evolution of large botnets. 


6.1 Validation of the Approach 


To validate our approach, we studied a botnet for which 
we had direct data about the number and IP addresses of 
the infected machines. More precisely, in August 2010, 
we obtained access to thirteen C&C servers belonging 
to the Cutwail botnet [35]. Note that we only used nine 
of them for this evaluation, since two had already been 
used to derive the optimal value of N in Section 4, and 
two were not actively sending spam at the time of the 
takedown. As discussed before, these C&C servers con- 
tained detailed information about the infected machines 
belonging to the botnet and the spam campaigns car- 
ried out. The whole botnet was composed of 30 C&C 
servers. By analyzing the data on the C&C servers we 
had access to, we found that, during the last day of opera- 
tion, 188,159 bots contacted these nine servers. Of these, 
37,914 (= 20%) contacted multiple servers. On average, 
each server controlled 20,897 bots at the time of the take- 
down, with a standard deviation of 5,478. Based on these 
Statistics, the servers to which we had access managed 
the operations of between 29% and 37% of the entire bot- 
net. We believe the actual percentage of the botnet con- 
trolled by these servers was close to 30%, since all the 
servers except one were contacted by more than 19,000 
bots during the last day of operation. Only a single server 
was controlling less than 10,000 bots. Therefore, it is 
safe to assume that the vast majority of the command 
and control servers were controlling a similar amount of 
bots (= 20,000 each). 

We ran the validation experiment for the period be- 
tween July 28 and August 16, 2010. For each of the 18 
days, we first selected a subset of the IP addresses refer- 
enced by the nine C&C servers. As a second step, with 
the help of the spam trap, we identified which campaigns 
had been carried out by these IP address during that day. 
Then, we generated seed and magnified pools. Finally, 
we compared the output magnification sets against the 
ground truth (i.e., the other IP addresses referenced by 
the C&C servers) to assess the quality of the results. 

Overall, BOTMAGNIFIER identified 144,317 IP ad- 
dresses as Cutwail candidates in the campaign set. Of 
these, 33,550 (= 23%) were actually listed in the C&C 
servers’ databases as bots. This percentage is close to 
the fraction of the botnet we had access to (since we con- 
sidered 9 out of 30 C&C servers), and, thus, this result 
suggests that the magnified population identified by our 
system is consistent. To perform a more precise analy- 
sis, we ran BOTMAGNIFIER and studied the magnified 
pools that were given as an output on a daily basis. The 
average size of the magnified pools was 4,098 per day. 
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In total, during the 18 days of the experiment, we grew 
the bot population by 73,772 IP addresses. Of the IP ad- 
dresses detected by our tool, 17,288 also appeared in the 
spam trap during at least one other day of our experiment, 
sending emails belonging to the same campaigns carried 
out by the C&C servers. This confirms that they were 
actually Cutwail bots. In particular, 3,381 of them were 
detected by BOTMAGNIFIER before they ever appeared 
in the spam trap, which demonstrates that we can use our 
system to detect bots before they even hit our spam trap. 


For further validation, we checked our results against 
the Spamhaus database, to see if the IP addresses we 
identified as bots were listed as known spammers or not. 
81% were listed in the blacklist. 


We then tried to evaluate how many of the remaining 
27,421 IP addresses were false positives. To do this, we 
used two techniques. First, we tried to connect to the 
host to check whether it was a legitimate server. Legit- 
imate SMTP or DNS servers can show up in queries on 
Spamhaus due to several reasons (e.g., in cases where 
reputation services collect information about sender IP 
addresses or if an email server is configured to query 
the local DNS server). Therefore, we tried to determine 
if an IP address that was not blacklisted at the time of 
the experiment was a legitimate email or DNS server by 
connecting to port 25 TCP and 53 UDP. If the server re- 
sponded, we considered it to be a false positive. Unfor- 
tunately, due to firewall rules, NAT gateways, or network 
policies, some servers might not respond to our probes. 
For this reason, as a second technique, we executed a 
reverse DNS lookup on the IP addresses, looking for ev- 
idence showing that the host was a legitimate server. In 
particular, we looked for strings that are typical for mail 
servers in the hostname. These strings are smtp, mail, 
mx, post, and mta. We built this list by manually look- 
ing at the reverse DNS lookups of the IP address that 
were not blacklisted by Spamhaus. If the reverse lookup 
matched one of these strings, we considered the IP ad- 
dress as a legitimate server, i.e., a false positive. In total, 
2,845 IP addresses resulted in legitimate servers (1,712 
SMTP servers and 1,431 DNS servers), which is 3.8% of 
the overall magnified population. 


We then tried to determine what coverage of the en- 
tire Cutwail botnet our approach produced. Based on the 
number of active IP addresses per day we saw on the 
C&C servers, we estimated that the size of the botnet 
at the time of the takedown was between 300,000 and 
400,000 bots. This means that, during our experiment, 
we were able to track between 35 and 48 percent of the 
botnet. Given the limitations of our transaction log (see 
Section 6.2.1), this is a good result, which could be im- 
proved by getting access to multiple Spamhaus servers 
or more complete data streams. 
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6.2 Tracking Bot Populations 


To demonstrate the practical feasibility of our approach, 
we used BOTMAGNIFIER to track bot populations in the 
wild for a period of four months. In particular, we ran 
the system for 114 days between September 28, 2010 
and February 5, 2011. We had a downtime of about 15 
days in November 2011, during which the emails of the 
spam trap could not be delivered. 

By using our magnification algorithm, our system 
identified and tracked 2,031,110 bot IP addresses dur- 
ing the evaluation period. Of these, 925,978 IP addresses 
(= 45.6%) belonged to magnification sets (i.e., they were 
generated by the magnification process), while 1,105,132 
belonged to seed pools generated with the help of the 
spam trap. 


6.2.1 Data Streams Limitations 


The limited view we have from the transaction log gen- 
erated by only one DNSBL mirror limits the number of 
bots we can track each day. BOTMAGNIFIER requires 
an IP address to appear a minimum number of times in 
the transaction log, in order to be considered as a po- 
tential bot. From our DNSBL mirror, we observed that 
a medium size campaign targets about 50,000 different 
destination servers (i.e., |Z'(p;)| = 50,000). The value 
of N for such a campaign, calculated using equation 5, 
is 50. On an average day, our DNSBL mirror logs activ- 
ity performed by approximately 4.7 million mail senders. 
Of these, only about 530,000 (= 11%) appear at least 50 
times. Thus, we have to discard a large number of po- 
tential bots a priori, because of the limited number of 
transactions our Spamhaus mirror observes. If we had 
access to more transaction logs, our visibility would in- 
crease, and, thus, the results would improve accordingly. 





6.2.2 Overview of Tracking Results 


For each day of analysis, BOTMAGNIFIER identified 
the largest spam campaigns active during that day (Sec- 
tion 2), learned the behavior of a subset of IP addresses 
carrying out those campaigns (Section 3), and grew a 
population of IP addresses behaving in the same way 
(Section 4). This provided us with the ability to track 
the population of the largest botnets, monitoring how ac- 
tive they were, and determining which periods they were 
silent. 

A challenging aspect of tracking botnets with Bot- 
MAGNIFIER has been assigning the right label to the var- 
ious spam campaigns (i.e., the name of the botnet that 
generated them). Tagging the campaigns that we ob- 
served in our honeypot environment was trivial, while for 
the others we used the clustering techniques described in 
Section 5. In total, we observed 1,475 spam campaigns. 
We tried to assign a botnet label to each cluster, and ev- 
ery time two clusters were assigned the same label, we 
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merged them together. After this process, we obtained 
38 clusters. Seven of them were large botnets, which 
generated 50,000 or more bot IP addresses in our magni- 
fication results. The others were either smaller botnets, 
campaigns carried out by dedicated servers (i.e., not car- 
ried out by botnets), or errors produced by the clustering 
process. 

We could not assign a cluster to 107 campaigns (© 7% 
of all campaigns), and we magnified these campaigns 
independently from the others. Altogether, the magni- 
fied sets of these campaigns accounted for 20,675 IP ad- 
dresses ( 2% of the total magnified hosts). We then 
studied the evolution over time and the spamming capa- 
bilities of the botnets we were able to label. 


6.2.3 Analysis of Magnification Results 


Table | shows some results from our tracking. For each 
botnet, we list the number of IP addresses we obtained 
from the magnification process. Interestingly, Lethic, 
with 887,852 IP addresses, was the largest botnet we 
found. This result is in contrast with the common be- 
lief in the security community that, at the time of our 
experiment, Rustock was the largest botnet [18]. How- 
ever, from our observation, Rustock bots appeared to be 
more aggressive in spamming than the Lethic bots. In 
fact, each Rustock bot appeared, on average, 173 times 
per day on our DNSBL mirror logs, whereas each Lethic 
bot showed up only 101 times. 

For each botnet population we grew, we distinguished 
between static and dynamic IP addresses. We considered 
an IP address as dynamic if, during the testing period, we 
observed that IP address only once. On the other hand, if 
we observed the same IP address multiple times, we con- 
sider it as static. The fraction of static versus dynamic 
IP addresses for the botnets we tracked goes from 15% 
for Rustock to 4% for MegaD. Note that smaller botnets 
exceeded the campaign size thresholds required by BOT- 
MAGNIFIER (see Section 5) less often than larger bot- 
nets, and therefore it is possible that our system under- 
estimates the number of IP addresses belonging to the 
MegaD and Waledac botnets. 

Figures 2(a) and 2(b) show the growth of IP addresses 
over time for the magnification sets belonging to Lethic 
and Rustock (note that we experienced a downtime of 
the system during November 2010). The figures show 
that dynamic IP addresses steadily grow over time, while 
static IP addresses reach saturation after some time. Fur- 
thermore, it is interesting to notice that we did not ob- 
serve much Rustock activity between December 24, 2010 
and January 10, 2011. Several sources reported that the 
botnet was (almost) down during this period [14, 37]. 
BOTMAGNIFIER confirms this downtime of the botnet, 
which indicates that our approach can effectively track 
the activity of botnets. After the botnet went back up 
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Botnet | Total # of IP addresses | # of dynamic IP addresses | # of static IP addresses | # of events per bot 
(per day) 
Lethic 887,852 770,517 117,335 101 
Rustock 676,905 572,445 104,460 173 
Cutwail 319,355 285,223 34,132 208 
MegaD 68,117 65,062 3,055 112 
Waledac 36,058 32,602 3,450 140 
Table 1: Overview of the BOTMAGNIFIER results 
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Figure 2: Growth of the dynamic and static IP address populations for the two major botnets 


again in January 2011, we observed a steady growth in 
the number of Rustock IP addresses detected by BOT- 
MAGNIFIER. 

Figures 3(a) and 3(b) show the cumulative distribu- 
tion functions of dynamic IP addresses and static IP ad- 
dresses tracked during our experiment for the five largest 
botnets. It is interesting to see that we started observing 
campaigns carried out by Waledac on January 1, 2011. 
This is consistent with the reports from several sources, 
who also noticed that a new botnet appeared at the same 
time [17,31]. We also observed minimal spam activities 
associated with MegaD after December 7, 2011. This 
was a few days after the botmaster was arrested [30]. 


6.3 Application of Results 


False positives. In Section 4, we showed how the pa- 
rameter k minimizes the ratio between true positives and 
false positives. We initially tolerated a small number of 
false positives because these do not affect the big picture 
of tracking large botnet populations. However, we want 
to quantify the false positive rate of the results, i.e., how 
many of the bot candidates are actually legitimate ma- 
chines. This information is important, especially if BOT- 
MAGNIFIER is used to inform Internet Service Providers 
or other organizations about infected machines. Further- 
more, if we want to use the results to improve spam fil- 
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tering systems, we need to be very careful about which 
IP addresses we consider as bots. We use the same tech- 
niques outlined in Section 6.1 to check for false posi- 
tives. We remove each IP address that matches any of 
these techniques from the magnified sets. 

We ran this false positive detection heuristic on all the 
magnified IP addresses identified during the evaluation 
period. This resulted in 35,680 (~1.6% of the total) IP 
addresses marked as potential false positives. While this 
might sound high at first, we also need to evaluate how 
relevant this false positive rate is in practice: our results 
can be used to augment existing systems and thus we can 
tolerate a certain rate of false positives. In addition, while 
deploying BOTMAGNIFIER in a production system, one 
could add a filter that applies the techniques from Sec- 
tion 6.1 to any magnified pool, and obtain clean results 
that he could use for spam reduction. 


Improving existing blacklists. We wanted to under- 
stand whether our approach can improve existing black- 
lists by providing information about spamming bots that 
are currently active. To achieve this, we analyzed the 
email logs from the UCSB computer science department 
over a period of two months, from November 30, 2010 
to February 8, 2011. As a first step, the department mail 
server uses Spamhaus as a pre-filtering mechanism, and 
therefore the majority of the spam gets blocked before 
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Figure 3: Cumulative Distribution Function for the bot populations grown by BOTMAGNIFIER 


being processed. For each email whose sender is not 
blacklisted, the server runs SpamAssassin [3] for content 
analysis, to find out if the message content is suspicious. 
SpamAssassin assigns a spam score to each message, and 
the server flags it as spam or ham according to that value. 
These two steps are useful to evaluate how BOTMAGNI- 
FIER performs, for the following reasons: 


e If a mail reached the server during a certain day, it 
means that at that time its sender was not blacklisted 
by Spamhaus. 

e The spam ratios computed by SpamAssassin pro- 
vide a method for the evaluation of BOTMAGNI- 
FIER’S false positives. 


During the analysis period, the department mail server 
logged 327,706 emails in total, sent by 228,297 distinct 
IP addresses. Of these, 28,563 emails were considered 
as spam by SpamAssassin, i.e., they bypassed the first 
filtering step based on Spamhaus. These mails had been 
sent by 10,284 IP addresses. We compared these IP ad- 
dresses with the magnified sets obtained by BOTMAG- 
NIFIER during the same period: 1,102 (& 10.8%) ap- 
peared in the magnified sets. We then evaluated how 
many of these IP addresses would have been detected be- 
fore reaching the server if our tool would have been used 
in parallel with the DNSBL system. To do this, we an- 
alyzed how many of the spam sender IP addresses were 
detected by BOTMAGNIFIER before they sent spam to 
our server. We found 295 IP addresses showing this be- 
havior. All together, these hosts sent 1,225 emails, which 
accounted for 4% of the total spam received by the server 
during this time. 

We then wanted to quantify the false positives in the 
magnified pools generated by BOTMAGNIFIER. To do 
this, we first searched for those IP addresses that were 
in one of the magnification pools, but had been consid- 
ered sending ham by SpamAssassin. This resulted in 28 
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matches. Of these, 15 were blacklisted by Spamhaus 
when we ran the tests, and therefore we assume they are 
false negatives by SpamAssassin. Of the remaining 13 
hosts, 12 were detected as legitimate servers by the fil- 
ters described in Section 6.1. For the remaining one IP 
address, we found evidence of it being associated with 
spamming behavior on another blacklist [23]. We there- 
fore consider it as a false negative by SpamAssassin as 
well. 

In summary, we conclude that BOTMAGNIFIER can 
be used to improve the spam filtering on the department 
email server: the server would have been reached by 4% 
less spam mails, and no legitimate emails would have 
been dropped by mistake within these two months. Hav- 
ing access to more Spamhaus mirrors would allow us to 
increase this percentage. 


Resilience to evasion. If the techniques introduced by 
BOTMAGNIFIER become popular, spammers will mod- 
ify their behavior to evade detection. In this section, we 
discuss how we could react to such evasion attempts. 

The first method that could be used against our system 
is obfuscating the email subject lines, to prevent BOT- 
MAGNIFIER from creating the seed pools. If this was the 
case, we could leverage previous work [22,40] that takes 
into account the body of emails to identify emails that are 
sent by the same botnet. As an alternative, we could use 
different methods to build the seed pools, such as clus- 
tering bots based on the IPs of the C&C servers that they 
contact. 

Another evasion approach spammers might try is to re- 
duce the number of bots associated with each campaign. 
The goal would be to stay under the threshold required 
by BOTMAGNIFIER (i.e., 1,000) to work. This would re- 
quire more management effort on the botmaster’s side, 
since more campaigns would need to be run. Moreover, 
we could use other techniques to cluster the spam cam- 
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paigns. For example, it is unlikely that the spammers 
would set up a different website for each of the small 
campaigns they create. We could then cluster the cam- 
paigns by looking at the web sites the URLs in the spam 
emails point to. 

Other evasion techniques might be to assign a single 
domain to each spamming bot, or to avoid evenly dis- 
tributing email lists among bots. In the first case, BOT- 
MAGNIFIER would not be able to unequivocally identify 
a bot as being part of a specific botnet. However, the at- 
tribution requirement could be dropped, and these bots 
would still be detected as generic spamming bots. The 
second case would be successful in evading our current 
systems. However, this behavior involves something that 
spammers want to avoid: having the same bot sending 
thousands of emails to the same domain within a short 
amount of time would most likely result in the bot being 
quickly blacklisted. 


6.4 Universality of k 


In Section 4, we introduced a function to determine the 
optimal N value according to the size of the seed pool’s 
target |T'(p;)|. To do this, we analyzed the data from two 
C&C servers of the Cutwail botnet. One could argue that 
this parameter will work well only for campaigns carried 
out by that botnet. To demonstrate that the value of k 
(and subsequently of NV) estimated by the function pro- 
duces good results for campaigns carried out by other 
botnets, we ran the same precision versus recall tech- 
nique we used in Section 4 on other datasets. Specifi- 
cally, we analyzed 600 campaigns observed in the wild, 
that had been carried out by the other botnets we stud- 
ied (Lethic, Rustock, Waledac, and MegaD). Since we 
did not have access to full ground truth for these cam- 
paigns, we used the IP addresses from the seed pools as 
true positives, and the set of IP addresses not blacklisted 
by Spamhaus as false positives. For the purpose of this 
analysis, we ignored any other IP address returned by 
the magnification process (i.e., magnified IP addresses 
already blacklisted by Spamhaus). 

The results are shown in Figure 4. The figure shows 
the function plot of & in relation to the size of |T(p;)|. 
The dots show, for each campaign we analyzed, where 
the optimal value of & lies. As it can be seen, the func- 
tion of k we used approximates the optimal values for 
most campaigns well. This technique for setting & might 
also be used to set up BOTMAGNIFIER in the wild, when 
ground truth is not available. 


Data stream independence. In Section 2.2, we 
claimed that BOTMAGNIFIER can work with any kind of 
transaction log as long as this dataset provides informa- 
tion about which IP addresses sent email to which des- 
tination email servers at a given point in time. To con- 
firm this claim, we ran BOTMAGNIFIER on an alterna- 
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Figure 4: Analysis of our function for k compared to the 
optimal value of k for 600 campaigns 
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Figure 5: Precision vs. Recall functions for five cam- 
paigns observed in the netflow dataset 


tive dataset, extracted from netflow records [7] collected 
by the routers of a large Internet service provider. The 
netflow data is collected with a sampling rate of | out 
of 1,000. To extract the data in a format BOTMAGNI- 
FIER understands, we extracted each connection directed 
to port 25 TCP, and considered the timestamp in which 
the connection initiated as the time the email was sent. 
On average, this transaction log contains 1.9 million en- 
tries per day related to about 194,000 unique sources. 
To run BOTMAGNIFIER on this dataset, we first need 
to correctly dimension k. As explained in Section 4, the 
equation for k is stable for any transaction log. How- 
ever, the value of the constants ky and a changes for 
each dataset. To correctly dimension these parameters, 
we ran BOTMAGNIFIER on several campaigns extracted 
from the netflow records. The PR(k) analysis is shown 
in Figure 5. The optimal point of the campaigns is lo- 
cated at a lower & for this dataset compared to the ones 
analyzed in Section 4. To address this difference, we 
set ky to 0.00008 and a to | when dealing with netflow 
records as transaction logs. After setting these param- 
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eters, we analyzed one week of data with BOTMAGNI- 
FIER. The analysis period was between January 20 and 
January 28, 2011. During this period, we tracked 94,894 
bots. Of these, 36,739 (= 38.7%) belonged to the mag- 
nified sets of the observed campaigns. In particular, we 
observed 40,773 Rustock bots, 20,778 Lethic bots, 6,045 
Waledac bots, and 1,793 Cutwail bots. 


7 Related Work 


Spam is one of the major problems on the Internet, and as 
a result, has attracted a considerable amount of research. 
In this section, we briefly review related work in this area 
and discuss the novel aspects of BOTMAGNIFIER. 


Botnet Tracking. A popular method to gain deeper in- 
sights into a particular botnet is botnet tracking, i.e., an 
attempt to learn more about a given botnet by analyzing 
its inner workings in detail [1,8]. There are several ap- 
proaches to conduct the actual analysis, for example by 
taking over the C&C infrastructure and then performing 
a live analysis [26,33]. An orthogonal approach is to take 
down the C&C server and perform an offline analysis of 
the server to reconstruct information [21]. A less inva- 
sive approach is to (automatically) reverse-engineer the 
communication protocol used by the botnet and then im- 
personate a bot [4, 6, 15,32]. This enables a continuous 
collection of information about the given botnet, e.g., to 
gather the spam templates used by the bots [6]. 

BOTMAGNIFIER complements these approaches: we 
are able to track spamming botnets on the Internet in a 
non-invasive way from a novel vantage point. The in- 
formation generated by our tool enables us to perform a 
high-level study of botnets. For example, we can track 
their size and evolution over time, and obtain a live view 
of hosts that belong to a particular botnet. 

Ramachandran et al. also analyzed queries against a 
DNSBL to reveal botnet memberships [27], but their 
motivation is completely different from ours: the intu- 
ition behind their approach is that bots might check if 
their own IP address is blacklisted by a given DNSBL. 
Such queries can be detected, which discloses informa- 
tion about infected machines. BOTMAGNIFIER is com- 
plementary with respect to this approach because it an- 
alyzes intrinsic traces left by spamming machines (i.e., 
an email server will query the DNSBL for information), 
and clustering and enriching this data enables us to find 
spambots in a generic way. Furthermore, we demon- 
strated that our approach can also be used on other kinds 
of transaction logs. 


Spam Studies. Several studies analyzed spam and the 
side-effects of this business [2, 12, 16, 42, 43]. Bor- 
LAB [11], a tool to correlate incoming spam mails with 
outgoing spam collected by executing known bots in an 
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analysis environment, shares some characteristics with 
our approach. The analysis results of BOTLAB can ap- 
proximate the relative size of different spamming bot- 
nets and provide insights into current spam campaigns 
based on the information collected at the site running the 
tool. In contrast, BOTMAGNIFIER enables us to detect 
IP addresses of hosts that belong to spamming botnets 
at an Internet-wide level. We use the analysis environ- 
ment only to collect information that enables us to as- 
sign labels to spam campaigns, while all other analysis 
techniques (e.g., the DNSBL analysis) are different com- 
pared to BOTLAB. 

Another system that shares some similarities with our 
approach is AUTORE [40], which examines content- 
level features in the email body such as URLs to group 
spam messages into campaigns. The authors performed a 
large-scale evaluation based on mail messages collected 
by a large webmail provider to generate signatures to de- 
tect polymorphic modifications for individual spam cam- 
paigns. Xie et al. also examined characteristics of the 
spam campaigns, similar to our work. In contrast, our 
approach focuses primarily on the behavioral similarities 
between members of a spamming botnet, without requir- 
ing knowledge of the actual spam content. 


Spam Mitigation. The typical approaches to detect 
spam either focus on the content of spam messages [3, 
22,40] or on the analysis of network-level features [10, 
25, 26, 28, 29,38]. BOTMAGNIFIER generates lists of IP 
addresses that belong to spamming botnets, which com- 
plements both kinds of approaches: the analysis results 
can be used to improve systems that use network-level 
features to detect spambots, e.g., by proactively listing 
such IP addresses in blacklists, or complement existing 
systems, as demonstrated in Section 6.3. Furthermore, 
the information can be used to notify ISPs about infected 
customers within their networks. 


8 Conclusion 


We presented BOTMAGNIFIER, a tool for tracking and 
analyzing spamming botnets. The tool is able to “mag- 
nify” an initial seed pool of spamming IP addresses 
by learning the behavior of known spamming bots and 
matching the learned patterns against a (partial) log of 
the email transactions carried out on the Internet. We 
have validated and evaluated our approach on a number 
of datasets (including the ground truth data from a bot- 
net’s C&C hosts), showing that BOTMAGNIFIER is in- 
deed able to accurately identify and track botnets. 
Future work will focus on finding new data inputs that 
can either populate our initial seed pools or on obtain- 
ing a different, more comprehensive transaction log to be 
able to identify spamming bots more comprehensively. 


USENIX Association 


Also, analyzing larger data streams might allow us to ap- 
ply more features for our magnification process, produc- 
ing more complete results. 
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Abstract 


A distinguishing characteristic of bots is their ability 
to establish a command and control (C&C) channel. The 
typical approach to build detection models for C&C traf- 
fic and to identify C&C endpoints (IP addresses and do- 
mains of C&C servers) is to execute a bot in a controlled 
environment and monitor its outgoing network connec- 
tions. Using the bot traffic, one can then craft signa- 
tures that match C&C connections or blacklist the IP 
addresses or domains that the packets are sent to. Un- 
fortunately, this process is not as easy as it seems. For 
example, bots often open a large number of additional 
connections to legitimate sites (to perform click fraud 
or query for the current time), and bots can deliberately 
produce “noise” — bogus connections that make the anal- 
ysis more difficult. Thus, before one can build a model 
for C&C traffic or blacklist IP addresses and domains, 
one first has to pick the C&C connections among all the 
network traffic that a bot produces. 

In this paper, we present JACKSTRAWS, a system that 
accurately identifies C&C connections. To this end, we 
leverage host-based information that provides insights 
into which data is sent over each network connection as 
well as the ways in which a bot processes the informa- 
tion that it receives. More precisely, we associate with 
each network connection a behavior graph that captures 
the system calls that lead to this connection, as well as 
the system calls that operate on data that is returned. 
By using machine learning techniques and a training 
set of graphs that are associated with known C&C con- 
nections, we automatically extract and generalize graph 
templates that capture the core of different types of C&C 
activity. Later, we use these C&C templates to match 
against behavior graphs produced by other bots. Our 
results show that JACKSTRAWS can accurately detect 
C&C connections, even for novel bot families that were 
not used for template generation. 


1 Introduction 


Malware is a significant threat and root cause for many 
security problems on the Internet, such as spam, dis- 
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tributed denial of service attacks, data theft, or click 
fraud. Arguably the most common type of malware 
today are bots. Compared to other types of malware, 
the distinguishing characteristic of bots is their abil- 
ity to establish a command and control (C&C) channel 
that allows an attacker to remotely control and update a 
compromised machine. A number of bot-infected ma- 
chines that are combined under the control of a single 
entity (called the botmaster) are referred to as a bot- 
net [7,8, 14,37]. 

Researchers and security vendors have proposed 
many different host-based or network-based techniques 
to detect and mitigate botnets. Host-based detectors 
treat bots like any other type of malware. These sys- 
tems (e.g., anti-virus tools) use signatures to scan pro- 
grams for the presence of well-known, malicious pat- 
terns [43], or they monitor operating system processes 
for suspicious activity [26]. Unfortunately, current tools 
suffer from low detection rates [4], and they often in- 
cur a non-negligible performance penalty on end users’ 
machines. To complement host-based techniques, re- 
searchers have explored network-based detection ap- 
proaches [15-18, 34, 41,45, 49]. Leveraging the insight 
that bots need to communicate with their command and 
control infrastructure, most network-based botnet detec- 
tors focus on identifying C&C communications. 


Initially, models that match command and control 
traffic were built manually [15,17]. To improve and 
accelerate this slow and tedious process, researchers 
proposed automated model (signature) generation tech- 
niques [34,45]. These techniques share a similar work 
flow (a work flow that, interestingly, was already used 
in previous systems to extract signatures for spreading 
worms [25,27,29,31,39]): First, one has to collect traces 
of malicious traffic, typically by running bot samples 
in a controlled environment. Second, these traces are 
checked for strings (or token sequences) that appear fre- 
quently, and can thus be transformed into signatures. 


While previous systems have demonstrated some suc- 
cess with the automated generation of C&C detectors 
based on malicious network traces, they suffer from 
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three significant shortcomings: The first problem is that 
bots do not only connect to their C&C infrastructure, but 
frequently open many additional connections. Some of 
the additional connections are used to carry out mali- 
cious activity (e.g., scanning potential victims, sending 
spam, or click fraud). However, in other cases, the traffic 
is not malicious per se. For example, consider a bot that 
connects to a popular site to check the Internet connec- 
tivity, or a bot that attempts to obtain the current time or 
its external IP address (e.g., local system settings are un- 
der the control of researchers who might try to trick mal- 
ware and trigger certain behaviors; they are thus unreli- 
able from the bot perspective [19,35]). In most of these 
cases, the malware traffic is basically identical to traffic 
produced by a legitimate client. Of course, one can use 
simple rules to discard some of the traffic (scans, spam), 
but other connections are much harder to filter; e.g., how 
to distinguish a HTTP-based C&C request from a re- 
quest for an item on a web site? Thus, there is a signif- 
icant risk that automated systems produce models that 
capture legitimate traffic. Unfortunately, a filtering step 
can remove such models only to a certain extent. 


To highlight the difficulty of finding C&C connections 
in bot traffic, we report on the analysis of a database that 
was given to us by a security company. This database 
contains network traffic produced by malware samples 
run in a dynamic analysis environment. Over a period 
of two months (Sept./Oct. 2010), this company ana- 
lyzed 153,991 malware samples that produced a total 
of 593,012 connections, after removing all empty and 
scan-related traffic. A significant majority (87.9%) of 
this traffic was HTTP, followed by mail traffic (3.8%) 
and small amounts of a wide variety of other protocols 
(including IRC). The company used two sets of signa- 
tures to analyze their traffic: One set matches known 
C&C traffic, the other set matches traffic that is known 
to be harmless. This second set is used to quickly discard 
from further analysis connections that are known to be 
unrelated to any C&C activity. Such connections include 
accesses to ad networks, search engines, or games sites. 
Using these two signature sets, we found 109,600 mali- 
cious C&C connections (18.5%), but also 69,211 benign 
connections (11.7%). The remaining 414,201 connec- 
tions (69.8%) were unknown; they did not match any 
signature, and thus, likely consist of a mix of malicious 
and harmless traffic. This demonstrates that it is chal- 
lenging to distinguish between harmless web requests 
and HTTP-based C&C connections. 


The second problem with existing techniques is that 
attackers can confuse automated model (signature) gen- 
eration systems: previous research has presented “noise 
injection” attacks in which a malware crafts additional 
connections with the sole purpose to thwart signature 
extraction techniques [10, 11,33]. A real-world exam- 
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ple for such a behavior can be found in the Pushdo mal- 
ware family, where bots, in certain versions, create junk 
SSL connections to more than 300 different web sites to 
blend in with benign traffic [1]. 

The third problem is that existing techniques do not 
work when the C&C traffic is encrypted. Clearly, it is 
not possible to extract a content signature to model en- 
crypted traffic. However, even when the traffic is en- 
crypted, it would be desirable to add the C&C server 
destinations to a blacklist or to model alternative net- 
work properties that are not content-based. For this, it is 
necessary to identify those encrypted malware connec- 
tions that go to the C&C infrastructure and distinguish 
them from unrelated but possibly encrypted traffic, such 
as legitimate, SSL-encrypted web traffic. 

The root cause for the three shortcomings is that ex- 
isting approaches extract models directly from network 
traces. Moreover, they do so at a purely syntactic level. 
That is, model generation systems simply select ele- 
ments that occur frequently in the analyzed network traf- 
fic. Unfortunately, they lack “understanding” of the 
purpose of different network connections. As a result, 
such systems often generate models that match irrele- 
vant, non-C&C traffic, and they incorrectly consider de- 
coy connections. Moreover, in the case of encrypted 
traffic, no frequent element can be found at all. 

To solve the aforementioned problems, we propose an 
approach to detect the network connections that a mal- 
ware program uses for command and control, and to dis- 
tinguish these connections from other, unrelated traffic. 
This allows us to immediately consider the destination 
hosts/domains for inclusion in a blacklist, even when the 
corresponding connections are encrypted. Moreover, we 
can feed signature generation systems with only C&C 
traffic, discarding irrelevant connections and making it 
much more difficult for the attacker to inject noise. 

We leverage the key observation that we can use host- 
based information to learn more about the semantics of 
network connections. More precisely, we monitor the 
execution of a malware process while it communicates 
over the network. This allows us to determine, for each 
request, which data is sent over the network and where 
this data comes from. Moreover, we can determine how 
the program uses data that it receives over the network. 
Using this information, we can build models that cap- 
ture the host-based activity associated with individual 
network connections. Our models are behavior graphs, 
where the nodes are system calls and the edges represent 
data flows between system calls. 

We use machine-learning to build graph-based models 
that characterize malicious C&C connections (e.g., con- 
nections that download binary updates that the malware 
later executes, or connections in which the malware up- 
loads stolen data to a C&C server). More precisely, start- 
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ing from labeled sets of graphs that are related to both 
known C&C connections and other, irrelevant malware 
traffic, we identify those subgraphs that are most char- 
acteristic of C&C communication. In the next step, we 
abstract from these specific subgraphs and produce gen- 
eralized graph templates. Each graph template captures 
the core characteristics of a different type or implemen- 
tation of C&C communication. These graph templates 
can be used to recognize C&C connections of bots that 
have not been analyzed previously. Moreover, our tem- 
plates possess explanatory capabilities and can help ana- 
lysts to understand how a particular bot utilizes its C&C 
channel (e.g., for binary updates, configuration files, or 
information leakage). 

Our experiments demonstrate that our system can 
generate C&C templates that recognize host-based ac- 
tivity associated with known, malicious traffic with high 
accuracy and very few false positives. Moreover, we 
show that our templates also generalize; that is, they de- 
tect C&C connections that were previously unknown. 
The contributions of this paper are the following: 


e We present a novel approach to identify C&C com- 
munication in the large pool of network connec- 
tions that modern bots open. Our approach lever- 
ages host-based information and associates mod- 
els, which are based on system call graphs, with the 
data that is exchanged over network connections. 

e We present a novel technique that generalizes sys- 
tem call graphs to capture the “essence” of, or 
the core activities related to, C&C communication. 
This generalization step extends previous work on 
system call graphs, and provides interesting in- 
sights into the purpose of C&C traffic. 

e We implemented these techniques in a tool called 
JACKSTRAWS and evaluated it on 130,635 connec- 
tions produced by more than 37 thousands malware 
samples. Our results show that the generated tem- 
plates detect known C&C traffic with high accu- 
racy, and less than 0.2% false positives over harm- 
less traffic. Moreover, we found 9,464 previously- 
unknown C&C connections, improving the cover- 
age of hand-crafted network signatures by 60%. 


2 System Overview 


Our system monitors the execution of a malware pro- 
gram in a dynamic malware analysis environment (such 
as Anubis [20], BitBlaze [40], CWSandbox [44], or 
Ether [9]). The goal is to identify those network con- 
nections that are used for C&C communication. To this 
end, we record the activities (in our case, system calls) 
on the host that are related to data that is sent over and 
received through each network connection. These activ- 
ities are modeled as behavior graphs, which are graphs 
that capture system call invocations and data flows be- 
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Figure 1: Example of behavior graph that shows infor- 
mation leakage. Underneath, the network log shows that 
the Windows ID was leaked via the GET parameter id. 


tween system calls. In our setting, one graph is asso- 
ciated with each connection. As the next step, all be- 
havior graphs that are created during the execution of a 
malware sample are matched against templates that rep- 
resent different types of C&C communication. When a 
graph matches a template sufficiently closely, the corre- 
sponding connection is reported as C&C channel. 

In the following paragraphs, we first discuss behavior 
graphs. We then provide an overview of the necessary 
steps to generate the C&C templates. 


Behavior graphs. A behavior graph G is a graph where 
nodes represent system calls. A directed edge e is in- 
troduced from node x to node y when the system call 
associated with y uses as argument some output that is 
produced by system call x. That is, an edge represents a 
data dependency between system calls x and y. Behav- 
ior graphs have been introduced in previous work as a 
suitable mechanism to model the host-based activity of 
(malware) programs [5,13,26]. The reason is that system 
calls capture the interactions of a program with its envi- 
ronment (e.g., the operating system or the network), and 
data flows represent a natural dependence and ordered 
relationship between two system calls where the output 
of one call is directly used as the input to the other one. 
Figure | shows an example of a behavior graph. This 
graph captures the host-based activity of a bot that reads 
the Windows serial number (ID) from the registry and 
sends it to its command and control server. Frequently, 
bots collect a wealth of information about the infected, 
local system, and they send this information to their 
C&C servers. The graph shows the system calls that are 
invoked to open and read the Windows ID key from the 
registry. Then, the key is sent over a network connec- 
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tion (that was previously opened with connect). An 
answer is finally received from the server (recv node). 

While behavior graphs are not novel per se, we use 
them in a different context to solve a novel problem. In 
previous work, behavior graphs were used to distinguish 
between malicious and benign program executions. In 
this work, we link behavior graphs to network traffic and 
combine these two views. That is, we use these graphs 
to identify command and control communication amidst 
all connections that are produced by a malware sample. 


C&C templates. As mentioned previously, the behav- 
ior graphs that are produced by our dynamic malware 
analysis system are matched against a set of C&C tem- 
plates. C&C templates share many similarities with be- 
havior graphs. In particular, nodes n carry information 
about system call names and arguments encoded as la- 
bels l,,, and edges e represent data dependencies where 
the type of flow is encoded as labels /.. The main differ- 
ence to behavior graphs is that the nodes of templates are 
divided into two classes; core and optional nodes. Core 
nodes capture the necessary parts of a malicious activity, 
while optional nodes are only sometimes present. 

To match a C&C template against a behavior graph G, 
we define a similarity function 6. This function takes as 
input the behavior graph G and a C&C template T’ and 
produces a score that indicates how well G matches the 
template. All core nodes of a template must at least be 
present in G in order to declare a match. 


Template generation. Each C&C template represents 
a certain type of command and control activity. We use 
the following four steps to generate C&C templates: 

In the first step, we run malware executables in our 
dynamic malware analysis environment, and extract the 
behavior graphs for their network connections. These 
connections can be benign or related to C&C traffic. 

JACKSTRAWS requires that some of these connections 
are labeled as either malicious or benign (for training). 
In our current system, we apply a set of signatures to 
all connections to find (i) known C&C communication 
and (ii) traffic that is known to be unrelated to C&C. 
Note that we have signatures that explicitly identify be- 
nign connections as such. The signatures were manually 
constructed, and they were given to us by a network se- 
curity company. By matching the signatures against the 
network traffic, we find a set of behavior graphs that are 
associated with known C&C connections (called mali- 
cious graph set) and a set of behavior graphs associated 
with non-C&C traffic (called benign graph set). These 
sets serve as the basis for the subsequent steps. 

It is important to observe that our general approach 
only requires labeled connections, without considering 
the payload of network connections. Thus, we could use 
other means to generate the two graph sets. For exam- 
ple, we can add a graph to the malicious set if the net- 
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work connection corresponding to this graph contacted 
a known blacklisted C&C domain. This allows us to 
create suitable graph sets even for encrypted C&C con- 
nections. One could also manually label connections. 

Of course, there are also graphs for which we do not 
have a classification (that is, neither a C&C signature nor 
a benign signature has matched). These unknown graphs 
could be related to either malicious or benign traffic, and 
we do not consider them in the subsequent steps. 

The second step uses the malicious and the benign 
graph sets as inputs and performs graph mining. More 
precisely, we use a graph mining technique, previously 
presented by Yan and Han [47,48], to identify subgraphs 
that frequently appear in the malicious graph set. These 
frequent subgraphs are likely to constitute the core activ- 
ity linked to C&C connections. Some post-processing is 
then applied to compact the set of mined subgraphs. Fi- 
nally, the set difference is computed between the mined, 
malicious subgraphs and the benign graph set. Only 
subgraphs that never appear by subgraph isomorphism 
in the benign graph set are selected. The assumption 
is that the selected subgraphs represent some host- and 
network-level activity that is only characteristic of par- 
ticular C&C connections, but not benign traffic. 

In [13], the authors used a similar approach to distin- 
guish between malware and harmless programs. To this 
end, the authors used a leap mining technique presented 
by Yan et al. [46] that selects subgraphs which maximize 
the information gain between the malicious and benign 
graph sets, that is to say subgraphs that maximally cover 
(detect) the entire collection of malicious graphs while 
introducing a very low number of false positives. How- 
ever, during the mining process, this technique tends to 
remove the graph parts that could be common to both 
benign and malicious graphs. In our present case, these 
parts are critical to obtain complete C&C templates. For 
example, in the case of a download and execute com- 
mand, if the download part of the graph is observed in 
the benign set, leap mining would only mine the execute 
part. For these reasons, we performed the set difference 
with the benign graph set only as post-processing, once 
complete malicious subgraphs have already been mined, 
without risk of losing parts of them. 

In addition, the algorithm proposed in [13] does not 
attempt to synthesize any semantic information from the 
mined behaviors; it does not produce a template that 
combines related behaviors and generalizes their com- 
mon core. In other words [13], “this synthesis step does 
not add new behaviors to the set, it only combines the 
ones previously mined.” In this paper, we go further and 
introduce two additional, novel steps to generalize the 
results obtained during the graph mining step. This is 
important because we want to generalize from specific 
instances of implementing a C&C connection and ab- 
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stract a core that characterizes the common and neces- 
sary operations for a particular type of command. 

As a third step, we cluster the graphs previously 
mined. The goal of this step is to group together graphs 
that correspond to a similar type of command and con- 
trol activity. That is, when we have observed differ- 
ent instances of one particular behavior, we combine 
the corresponding graphs into one cluster. As an ex- 
ample, consider different instances of a malware family 
where each sample downloads data from the network via 
HTTP, decodes it in some way, stores the data on disk, 
and finally executes that file. All instances of this be- 
havior are examples for typical bot update mechanisms 
(download and execute), and we want to group all of 
them into one cluster. As a result of this step, we ob- 
tain different clusters, where each cluster contains a set 
of graphs that correspond to a particular C&C activity. 

In the fourth step, we produce a single C&C template 
for each cluster. The goal of a template is to capture the 
common core of the graphs in a cluster; with the assump- 
tion that this common core represents the key activities 
for a particular behavior. The C&C templates are gener- 
ated by iteratively computing the weighted minimal com- 
mon supergraph (WMCS) [3] between the graphs in a 
cluster. The nodes and edges in the supergraph that are 
present in all individual graphs become part of the core. 
The remaining ones become optional. 

At the end of this step, we have extracted templates 
that match the core of the program activities for different 
types of commands, taking into account optional opera- 
tions that are frequently (but not always) present. This 
allows us to match variants of C&C traffic that might be 
different (to a certain degree) from the exact graphs that 
we used to generate the C&C templates. 


3 System Details 


In this section, we provide an overview of the actual im- 
plementation of JACKSTRAWS and explain the different 
analysis steps in greater details. 


3.1 Analysis Environment 


We use the dynamic malware analysis environment Anu- 
bis [20] as the basis for our implementation, and imple- 
mented several extensions according to our needs. Note 
that the general approach and the concepts outlined in 
this paper are independent of the actual analysis envi- 
ronment; we could have also used BitBlaze, Ether, or 
any other dynamic malware analysis environment. 

As discussed in Section 2, behavior graphs are used 
to capture and represent the host-based activity that mal- 
ware performs. To create such behavior graphs, we 
execute a malware sample and record the system calls 
that this sample invokes. In addition, we identify de- 
pendencies between different events of the execution 
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by making use of dynamic taint analysis [38], a tech- 
nique that allows us to asses whether a register or mem- 
ory value depends on the output of a certain operation. 
Anubis already comes with tainting propagation sup- 
port. By default, all output arguments of system calls 
from the native Windows API (e.g., NtCreateFile, 
NtCreateProcess, etc.) are marked with a unique 
taint label. Anubis then propagates the taint information 
while the monitored system processes tainted data. Anu- 
bis also monitors if previously tainted data is used as an 
input argument for another system call. 

While Anubis propagates taint information for data in 
memory, it does not track taint information on the file 
system. In other words, if tainted data is written to a 
file and subsequently read back into memory, the origi- 
nal taint labels are not restored. This shortcoming turned 
out to be a significant drawback in our settings: For ex- 
ample, bots frequently download data from the C&C, 
decode it in memory, write this data to a file, and later 
execute it. Without taint tracking through the file system, 
we cannot identify the dependency between the data that 
is downloaded and the file that is later executed. Another 
example is the use of configuration data: Many malware 
samples retrieve configuration settings from their C&C 
servers, such as URLs that should be monitored for sen- 
sitive data or address lists for spam purposes. Such con- 
figuration data is often written to a dedicated file before 
itis loaded and used later. Restoring the original taint la- 
bels when files are read ensures that the subsequent bot 
activity is linked to the initial network connection and 
improves the completeness of the behavior graphs. 

Finally, we improved the network logging abilities 
of Anubis by hooking directly into the Winsock API 
calls rather than considering only the abstract interface 
(NtDevicelOControlFile) at the native system 
call level. This allows us to conveniently reconstruct 
the network flows, since send and receive operations are 
readily visible at the higher-level APIs. 


3.2 Behavior Graph Generation 


When the sample and all of its child processes have ter- 
minated, or after a fixed timeout (currently set to 4 min- 
utes), JACKSTRAWS saves all monitored system calls, 
network-related data, and tainting information into a log 
file. Unlike previous work that used behavior graphs 
for distinguishing between malicious and legitimate pro- 
grams, we use these graphs to determine the purpose of 
network connections (and to detect C&C traffic). Thus, 
we are not interested in the entire activity of the mal- 
ware program. Instead, we only focus on actions related 
to network traffic. To this end, we first identify all send 
and receive operations that operate on a successfully- 
established network connection. In this work, we fo- 
cus only on TCP traffic, and a connection is considered 
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successful when the three-way handshake has completed 
and at least one byte of user data was exchanged. All 
system calls that are related to a single network con- 
nection are added to the behavior graph for this connec- 
tion. That is, for each network connection that a sample 
makes, we obtain one behavior graph which captures the 
host-based activities related to this connection. 

For each send operation, we check whether the sent 
data is tainted. If so, we add the corresponding system 
call that produced this data to the behavior graph and 
connect both nodes with an edge. Likewise, for each 
receive operation, we taint the received data and check 
if it is later used as input to a system call. If so, we also 
add this system call to the graph and connect the nodes. 

For each system call that is added to the graph in this 
fashion, we also check backward dependencies (that is, 
whether the system call has tainted input arguments). If 
this is the case, we continue to add the system call(s) 
that are responsible for this data. This process is re- 
peated recursively as long as there are system calls left 
that have tainted input arguments that are unaccounted 
for. That is, for every node that is added to our behav- 
ior graph, we will also add all parent nodes that produce 
data that this node consumes. For example, if received 
data is written to a local file, we will add the correspond- 
ing NtWriteFile system call to the graph. This write 
system call will use as one of its arguments a file han- 
dle. This file handle is likely tainted, because it was 
produced by a previous invocation of NtCreateFile. 
Thus, we also add the node that corresponds to this cre- 
ate system call and connect the two nodes with an edge. 
On the other hand, forward dependencies are not recur- 
sively followed to avoid an explosion in the graph size. 


Graph labeling. Nodes and edges that are inserted into 
the behavior graph are augmented with additional labels 
that capture more information about the nature of the 
system calls and the dependencies between nodes. For 
edges, the label stores either the names of the input or 
the output arguments of the system calls that are con- 
nected by a data dependency. For nodes, the label stores 
the system call name and some additional information 
that depends on the specific type of call. The additional 
information can store the type of the resource (files, reg- 
istry keys, ...) that a system call operates on as well as 
flags such as mode or permission bits. Note that some 
information is only stored as comment; this information 
is ignored for the template generation and matching, but 
is saved for a human analyst who might want to examine 
a template. 

One important additional piece of information stored 
for system calls that manipulate files and registry keys is 
the name of these files and keys. However, for these re- 
source names, it is not desirable to use the actual string. 
The reason is that labels are taken into account during 
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the matching process, and two nodes are considered the 
same only when their labels match. Thus, some type of 
abstraction is necessary for labels that represent resource 
names, otherwise, graphs become too specific. We gen- 
eralize file names based on the location of the file (using 
the path name) and its type (typically, based on the file’s 
extension). Registry key names are generalized by nor- 
malizing the key root (using abbreviations) and replac- 
ing random names by a generic format (typically, nu- 
merical values). More details about the labeling process 
and these abstractions can be found in Appendix A. 


Simplifying behavior graphs. One problem we faced 
during the behavior graph generation was that certain 
graphs grew very large (in terms of number of nodes), 
but the extra nodes only carried duplicate information. 
For example, consider a bot that downloads an exe- 
cutable file. When this file is large, the data will not 
be read from the network connection by a single recv 
call. Instead, the receive system call might be invoked 
many times; in fact, we have observed samples that read 
network data one byte at a time. Since every system call 
results in a node being added to the behavior graph, this 
can increase the number of nodes significantly. 

To reduce the number of (essentially duplicate) nodes 
in the graph, we introduce a post-processing step that 
collapses certain nodes. The purpose of this step is 
to combine multiple nodes, sharing the same label and 
dependencies. More precisely, for each pair of nodes 
with an identical label in the behavior graph, we check 
whether (1) the two nodes share the same set of parent 
nodes, or (2) the sets of parents and children of one node 
are respective subsets of the other, or (3) one node is the 
only parent of the other. If this is the case, we collapse 
these nodes into a single node and add a special tag Is- 
Multiple to the label. Additional incoming and outgoing 
edges of the aggregated nodes are merged into the new 
node. The process is repeated until no more collapsing 
is possible. As an example, consider the case where a 
write file operation stores data that was previously read 
from the network by multiple receive calls. In this case, 
the write system call node will have many identical par- 
ent nodes (the receive operations), which all contribute 
to the buffer that is written. In the post-processing step, 
these nodes are all merged into a single system call. A 
beneficial side-effect of node collapsing is that this does 
not only reduce the number of nodes, but also provides 
some level of abstraction from the concrete implementa- 
tion of the malware code and the number of times iden- 
tical functions are called (as part of a loop, for example). 


Summary. The output of the two previous steps is one 
behavior graph for each network connection that a mal- 
ware sample makes. Behavior graphs can be used in two 
ways: First, we can match behavior graphs, produced 
by running unknown malware samples, against a set 
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of C&C templates that characterize malicious activity. 
When a template matches, the corresponding network 
connection can be labeled as command and control. This 
matching procedure is explained in Section 3.6. 

The second use of behavior graphs is for C&C tem- 
plate generation. For this process, we assume that we 
know some connections that are malicious and some that 
are benign. We can then extract the subgraphs from 
the behavior graphs that are related to known malicious 
C&C connections and subgraphs that represent benign 
activity. These two sets of malicious and benign graphs 
form the input for the template generation process that 
is described in the following three sections. 


3.3. Graph Mining 


The first step when generating C&C templates is graph 
mining. More precisely, the goal is to mine frequent sub- 
graphs that are only present in the malicious set. An 
overview of the process can be seen in Figure 2. 
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Figure 2: Mining process. 
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Frequent subgraphs are those that appear in more than 
a fraction f; of all malicious graphs. When & is too high, 
we might miss many interesting behaviors (subgraphs) 
that are not frequent enough to exceed this threshold. 
When & is too low, more behaviors are covered, but un- 
fortunately, the mining process will produce such a mas- 
sive amount of graphs that it never terminates. We dis- 
cuss the concrete choice of k in Section 4. 


Frequent subgraph mining. There exist a number of 
tools that can be readily used for mining frequent sub- 
graphs. For this paper, we decided to use gSpan [47,48] 
because it is stable, and supports labeled graphs, both at 
the node and edge level. gSpan relies on a lexicographic 
ordering of graphs and uses a depth-first search strategy 
to efficiently mine frequent, connected subgraphs. 

A limitation of gSpan is that it only supports undi- 
rected edges, whereas behavior graphs are, by nature, 
directed since the edges represent data flows. To work 
around this limitation and produce directed subgraphs, 
we encode the direction of edges into their labels, and 
then restore the direction information at the end of the 
mining process. Moreover, gSpan accepts only numeric 
values as labels for nodes and edges. Thus, we cannot 
directly use the string labels (names or flags) that are as- 
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sociated with nodes and edges in the behavior graphs. 
To solve this, we simply concatenate all string labels of 
a node or edge and hash the result. Then, this hash value 
is mapped into a unique integer. 


Subgraph maximization. The output produced by 
gSpan contains many graphs that are subgraphs of oth- 
ers. The reason is that gSpan works by growing sub- 
graphs. That is, it first looks for individual nodes that 
are frequent. Then, gSpan adds one additional node and 
re-runs the frequency checks. This add-and-check pro- 
cess is repeated until no more frequent graphs can be 
found. However, during this process, gSpan outputs all 
subgraphs that are frequent. Thus, the result of the min- 
ing step contains all intermediate subgraphs whose fre- 
quency is above the selected threshold. 

Unfortunately, these redundant, intermediate sub- 
graphs negatively affect the subsequent template gener- 
ation steps because they distort the frequencies of nodes 
and edges. To solve this problem, we introduce a max- 
imization step. The purpose of this step is to remove 
a subgraph G',,,, if there exists a supergraph G'gyper in 
the same result set that contains G',,». Looking at Fig- 
ure 2, the result of the maximization step is that all 2- 
node graphs are removed because they are subgraphs of 
the 3-node graphs. However, removing subgraphs is not 
always desirable: even when both a subgraph G’,,,, and 
a supergraph G'super exceed the frequency threshold k, 
the subgraph G;;,,5 might be much more frequent than 
G'super. In this case, both graphs should be kept. To 
this end, we only remove a subgraph G',,,p when its fre- 
quency is less than twice the frequency of Gsuper. 


Graph sets difference. So far, we have mined graphs 
that frequently appear in the malicious set. However, we 
also require that these graphs do not appear in the benign 
set. Otherwise, they would not be suitable to distinguish 
C&C connections from other traffic. 

To remove graphs that are present in the benign set, 
we compute the set difference between the frequent ma- 
licious subgraphs and benign graphs. More precisely, we 
use a sub-isomorphism test to determine, for each mali- 
cious graph, whether it appears in some benign graphs. 
If this is the case, it is removed from the mining results. 
Looking at the example in Figure 2, the set difference 
removes one graph that also appears in the benign set. 
As an interesting technical detail, our approach of using 
set difference to obtain interesting, malicious subgraphs 
is different from the technique presented in [13]. In [13], 
the authors use leap mining, which operates simultane- 
ously on the malicious and benign sets to find graphs 
with a high frequency within the malicious set and a low 
frequency within the benign set [46]. 

By construction, leap mining removes all parts from 
the output that are shared between benign and malicious 
graphs. For example, consider a behavior graph that cap- 
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tures a command that downloads data, stores it to a file, 
and later executes this file. If the download part of this 
graph is also present in the benign set, which is likely to 
be the case (since downloading data is not malicious per 
se), this part will be removed. Thus, the malicious graph 
will only contain the part where the downloaded file is 
executed. That is, in this example, leap mining would 
produce an incomplete graph that covers only part of the 
relevant, malicious activity. In our case, we first gener- 
ate the entire graph that captures both the download and 
the execute. Then, the set difference algorithm checks 
whether this entire graph occurs also in the benign set. 
Since no benign graph is presumably a supergraph of the 
malicious behavior, the entire graph is retained. 


3.4 Graph Clustering 


Using as input the frequent, malicious subgraphs pro- 
duced by the previous mining step, the purpose of this 
step is to find clusters of similar graphs (see Figure 3). 
The graph mining step produces different graphs that 
represent different types of behaviors. We now need to 
cluster these graphs to find groups, where each group 
shares a common core of activities (system calls) typi- 
cal of a particular behavior. Graph clustering is used for 
this purpose; generated clusters are later used to create 
generalized templates covering the graphs they contain. 
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Figure 3: Clustering and generalization processes. 
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A crucial component for every clustering algorithm 
is the proper choice of a similarity measure that com- 
putes the distances between graphs. In our system, the 
similarity measure between two graphs is based on their 
non-induced, maximum common subgraph (mcs). The 
mcs of two graphs is the maximal subgraph that is iso- 
morphic to both. The intuition behind the measure is the 
following: We expect two graphs that represent the same 
malware behavior to share a common core of nodes that 
capture this behavior. The mcs captures this core. Thus, 
the mcs will be large for similar graphs. From now on, 
all references to the mcs will refer to the non-induced 
construction. The similarity measure is defined as: 


2 x jedges(mes(Gy, G2)| 
|edges(G1)| + |edges(Ge)| 


To compute the mcs between two graphs, we use the 
McGregor backtracking algorithm [6]. According to 


d(Gi,G2) = (1) 
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benchmarking results [6], this algorithm performs well 
on randomly-connected graphs with small density. In 
our case, behavior graphs have no cycles and only a lim- 
ited number of dependencies; this is close to randomly- 
connected graphs rather than regular or irregular meshes. 


As shown in Figure 3, we use the mcs similarity mea- 
sure to compute the one-to-one distance matrix between 
all mined graphs. We then use a tool, called Cluto [24], 
to find clusters of similar graphs. Cluto implements a 
variety of different clustering algorithms; we selected 
clustering by repeated bisection. This algorithm works 
as follows: All graphs are originally put into a single 
cluster. This cluster is then iteratively split until the sim- 
ilarity in each sub-cluster is larger than a given similarity 
threshold [24,50]. At each step, the cluster to be split is 
chosen so that the similarity function between the ele- 
ments of that clusters is maximized. The advantage of 
this technique is that we do not need to define a fixed 
number of clusters a priori. Moreover, one also does not 
need to select initial graphs as center to build the clusters 
around (as with k-means clustering). The output of this 
step is a set of clusters that contain similar graphs. 


3.5 Graph Generalization and Templating 


Based on the clusters of similar graphs, the final step in 
our template generation process is graph generalization 
(the rightmost step in Figure 3). The goal of the gen- 
eralization process is to construct a template graph that 
abstracts from the individual graphs within a cluster. In- 
tuitively, we would expect that a template contains a core 
of nodes, which are common to all graphs in a cluster. 
In addition, to capture small differences between these 
graphs, there will be optional nodes attached to the core. 


The generalization algorithm computes the weighted 
minimal common supergraph (WMCS) of all the graphs 
within a given cluster [3]. The WMCS is the minimal 
graph such that all the graphs of the cluster are con- 
tained within it. To distinguish between core and op- 
tional nodes, we use weights. These weights capture 
how frequent a node or an edge in the WMCS is present 
in one of the individual graphs. For core edges and core 
nodes, we expect that they are present in all graphs of a 
cluster (that is, their weight is n in the WMCS, assuming 
that there are n graphs in the cluster). All other nodes 
with a weight smaller than n become optional nodes. 


The approach to compute a template is presented in 
Algorithm 1. The WMCS is first initialized with the first 
graph G, of the cluster, and the weights of all its nodes 
and edges are set to 1. The integration of an additional 
graph G;; is performed as follows: We first determine 
the maximal common subgraph mcs between G; and the 
current WMC%S. The nodes and edges in the WMCS that 
are part of the mcs have their weight increased by 1. The 
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Algorithm 1 Weighted minimum common supergraph 


Algorithm 2 Template matching 





Require: A graph set G1,...,Gn 

1 WMCS+¢ Gy 

2: Vn € nodes(T) and e € edges(T) dow, := land we := 1 
3: fori = 2tondo 

: §:= state_exploration(@) 

mcs + maximum_common_subgraph(Gi,W MCS, s) 
Vn € nodes(mcs) do wn += 1 
Ve € edges(mcs) do we += 1 
Vn € nodes(G;) and n ¢ nodes(mcs), 
do W MCS.add_node(n) and wy, := 1 

9: Ve € edges(G;) and e ¢ edges(mes), 

do W MCS.add_edge(e) and we := 1 

10: end for 
11: return WMCS 
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nodes and edges in G; that are not part of the mcs are 
added to the WMCS, and their weight is set to 1. 

To increase the generality of a template, the labels of 
optional nodes are further relaxed. More precisely, our 
system preserves the label that stores the name of the 
system call. However, all additional information is re- 
placed by a wild card that matches every possible, con- 
crete parameter later. Finally, we remove all templates 
with a core of three or fewer nodes. The reason is that 
these templates are likely too small to accurately capture 
the entire malicious activity and might lead to false pos- 
itives. In the example in Figure 3, core nodes and edges 
are shown as dark elements, while the optional elements 
are white. We generate one C&C template per cluster. 


3.6 Template Matching 


The previous steps leveraged machine learning tech- 
niques and sets of known malicious and benign graphs to 
produce a number of C&C templates. These templates 
are graphs that represent host-based activity that is re- 
lated to command and control traffic. To find the C&C 
connections for a new malware sample, this sample is 
first executed in the sandbox, and our system extracts its 
behavior graphs (as discussed in Sections 3.1 and 3.2). 
Then, we match all C&C templates against the behavior 
graphs. Whenever a match is found, the corresponding 
connection is detected as command and control traffic, 
and we can extract its endpoints (IPs and domains). 

The matching technique is described in Algorithm 2. 
In a first step, we attempt to determine whether the core 
of a template 7’ is present in the behavior graph G. To 
this end, we simply use a subgraph isomorphism test. 
When the test fails, we know that the core nodes of T' 
are not part of the graph, and we can advance to trying 
the next template. If the core is found, we obtain the 
mapping from the core nodes to the corresponding nodes 
in G. We then test the optional nodes. To this end, we 
compute the mcs between JT’ and G. For this, the fixed 
mapping provided by the previous isomorphism test is 
used to initialize the space exploration when building the 
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Require: A behavior graph G, A template T 
: map + subgraph isomorphism(core(T), G) 
: if map = © then 
return false 
end if 
8 := state_exploration(map) 
mcs + maximum_common_subgraph(G, T, s) 
: return true, mcs 


Bay Gee Boer 





mcs, significantly speeding up the process. Based on the 
result of the mcs computation, we can directly see how 
many optional nodes have matched, that is to say, are 
covered by the mcs. Taking into account the fraction (or 
the absolute number) of optional nodes that are found in 
G, we can declare a template match. 


4 Evaluation 


Experiments were performed to evaluate JACKSTRAWS 
both from a quantitative and qualitative perspective. 
This section describes the evaluation details and results. 


4.1 Evaluation Datasets 


For the evaluation, our system analyzed a total of 37,572 
malware samples. The samples were provided to us by 
a network security company, who obtained the binaries 
from recent submission to a public malware analysis 
sandbox. Moreover, we were only given samples that 
showed some kind of network activity when run in the 
sandbox. We were also provided with a set of 385 sig- 
natures specifically for known C&C traffic, as well as 
162 signatures that characterize known, benign traffic. 
As mentioned previously, the company uses signatures 
for benign traffic to be able to quickly discard harmless 
connections that bots frequently make. 

To make sure that our sample set covers a wide vari- 
ety of different malware families, we labeled the entire 
set with six different anti-virus engines: Kaspersky, F- 
Secure, BitDefender, McAfee, NOD32, and F-Prot. Us- 
ing several sources for labeling allows us reduce the pos- 
sible limitations of a single engine. For every malware 
sample, each engine returns a label (unless the samples 
is considered benign) from which we extract the mal- 
ware family substring. For instance, if one anti-virus 
engine classifies a sample as Win32.Koobface.AZ, then 
Koobface is extracted as the family name. The family 
that is returned by a majority of the engines is used to 
label a sample. In case the engines do not agree (and 
there is no majority for a label), we go through the out- 
put of the AV tools in the order that they were mentioned 
previously and pick the first, non-benign result. 

Overall, we identified 745 different malware families 
for the entire set. The most prevalent families were 
Generic (3756), EgroupDial (2009), Hotbar (1913), 
Palevo (1556), and Virut (1539). 4,096 samples re- 


20th USENIX Security Symposium 451 


452 


mained without label. Note that Generic is not a precise 
label; many different kinds of malware can be classified 
as such by AV engines. In summary, the results indicate 
that our sample set has no significant bias towards a cer- 
tain malware family. As expected, it covers a rich and 
diverse set of malware, currently active in the wild. 

In a first step, we executed all samples in JACK- 
STRAWS. Each sample was executed for four minutes, 
which allows a sample to initialize and perform its nor- 
mal operations. This timeout is typically enough to 
establish several network connections and send/receive 
data via them. The execution of the 37,572 samples 
produced 150,030 network connections, each associated 
with a behavior graph. From these graphs, we removed 
19,395 connections in which the server responded with 
an error (e.g., an HTTP request with a 404 “Not Found” 
response). Thus, we used a total of 130,635 graphs pro- 
duced by a total of 33,572 samples for the evaluation. 

In the next step, we applied our signatures to the 
network connections. This resulted in 16,535 connec- 
tions that were labeled as malicious (known C&C traffic, 
12.7%) and 16,082 connections that were identified as 
benign (12.3%). The malicious connections were pro- 
duced by 9,108 samples, while the benign connections 
correspond to 7,031 samples. The remaining 98,018 
connections (75.0%) are unknown. The large fraction of 
unknown connections is an indicator that it is very dif- 
ficult to develop a comprehensive set of signatures that 
cover the majority of bot-related C&C traffic. In partic- 
ular, there was at least one unclassified connection for 
31,671 samples. Note that the numbers of samples that 
produced malicious, benign, and unknown traffic add up 
to more than the total number of samples. This is be- 
cause some samples produced both malicious and be- 
nign connections. This underlines that it is difficult to 
pick the important C&C connections among bot traffic. 

Of course, not all of the 385 malicious signatures pro- 
duced matches. In fact, we observed only hits from 78 
C&C signatures, and they were not evenly distributed. 
A closer examination revealed that the signature that 
matched the most number of network connections is re- 
lated to Palevo (4,583 matches), followed by Ramnit 
(3,896 matches) and Koobface (2,690 matches). 


4.2 Template Generation 


Initially, we put all 16,535 behavior graphs that corre- 
spond to known C&C connections into the malicious 
graphs set, while the 16,082 graphs corresponding to be- 
nign connections were added to the benign graphs set. 
To improve the quality of these sets, we removed graphs 
that contained too few nodes, as well as graphs that 
contained only nodes that correspond to network-related 
system calls (and a few other house-keeping functions 
that are not security-relevant). Moreover, to maintain 
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a balanced training set, we kept at most three graphs 
(connections) for each distinct malware sample. This 
pre-processing step reduced the number of graphs in the 
malicious set to 10,801, and to 12,367 in the benign set. 

Both sets were then further split into a training set and 
a test set. To this end, we randomly picked a number 
of graphs for the training set, while the remaining ones 
were set aside as a test set. More precisely, for the mali- 
cious graphs, we kept 6,539 graphs (60.5%) for training 
and put 4,262 graphs (39.5%) into the test set. For the 
benign graphs, we kept 8,267 graphs (66.8%) for train- 
ing and put 4,100 graphs (33.2%) into the test set. We 
used these malicious and benign training sets as input 
for our template generation algorithm. This resulted in 
417 C&C templates that JACKSTRAWS produced. The 
average number of nodes in a template was 11, where 6 
nodes were part of the core and 5 were optional. 

For the mining process, we used a threshold & = 0.1. 
That is, the mining tool will pick subgraphs from the 
training sets only when they appear in more than 10% 
of all behavior graphs. The reason why we could oper- 
ate with a relatively large threshold of k = 0.1 is that 
we divided the behavior graphs into different bins, and 
mined on each bin individually. To divide graphs into 
bins, we observe that certain malware activity requires 
the execution of a particular set of system calls. For ex- 
ample, to start a new process, the malware needs to call 
NtCreateProcess, or to write to a file, it needs to 
call NtWriteFile. Thus, we selected five security- 
relevant system activities (registry access; file system 
access; process creation; queries to system information; 
and accesses to web-related resources, such as HTML or 
JS files) and assigned each to a different bin. Then, we 
put into each bin all behavior graphs that contain a node 
with the corresponding activity (system calls). Graphs 
that did not fall into any of these bins were gathered in 
a miscellaneous bin. It is important to observe that this 
step merely allows us to mine with a higher threshold, 
and thus to accelerate the graph mining process consid- 
erably. We would have obtained the same set of tem- 
plates (and possibly more) when mining on the entire 
training set with a lower mining threshold. 

For the clustering process, we iterated the bisection 
operation until the average similarity within the clus- 
ters was over 60% and the minimal similarity was over 
40%. Higher thresholds were discarded because they in- 
creased the number of clusters, making them too spe- 
cific. 

Producing templates for the 14,806 graphs in the 
training set took about 21 hours on an Intel Xeon 4 
CPUs 2.67GHz server, equipped with 16GB of RAM. 
This time was divided into 16 hours for graph mining, 
4.5 hours for clustering, and 30 minutes for graph gen- 
eralization. This underlines that, despite the potentially 
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costly (NP-hard) graph algorithms, JACKSTRAWS is able 
to efficiently produce results on a large, real-world in- 
put dataset. The mining process was the most time- 
consuming operation, but the number of mined sub- 
graphs was, in the end, five times smaller than the num- 
ber of graphs in input. Consequently, the clustering pro- 
cess, which is polynomial in function of the number of 
graphs in input, ran on a reduced set. For the template 
generation process, the resulting clusters only contained 
10 to 20 graphs on average, explaining the faster com- 
putations. 


4.3 Detection Accuracy 


In the next step, we wanted to assess whether the gen- 
erated templates can accurately detect activity related 
to command and control traffic without matching be- 
nign connections. To this end, we ran two experiments. 
First, we evaluated the templates on the graphs in the 
test set (which correspond to known C&C connections). 
Then, we applied the templates to graphs associated 
with unknown connections. This allows us to deter- 
mine whether the extracted C&C templates are generic 
enough to allow detection of previously-unknown C&C 
traffic (for which no signature exists). 


Experiment 1: Known C&C connections. For the first 
experiment, we made use of the test set that was pre- 
viously set aside. More precisely, we applied our 417 
templates to the behavior graphs in the test set. This test 
set contained 4,262 connections that matched C&C sig- 
natures and 8,267 benign connections. 

Our results show that JACKSTRAWS is able to success- 
fully detect 3,476 of the 4,262 malicious connections 
(81.6%) as command and control traffic. Interestingly, 
the test set also contained malware families that were 
absent from the malicious training set. 51.7% of the 
malicious connections coming from these families were 
successfully detected, accounting for 0.4% of all detec- 
tions. While the detection accuracy is high, we explored 
false negatives (i.e., missed detections) in more detail. 
Overall, we found three reasons why certain connections 
were not correctly identified: 

First, in about half of the cases, detection failed be- 
cause the bot did not complete its malicious action after 
it received data from the C&C server. Incomplete be- 
havior graphs can be due to a timeout of the dynamic 
analysis environment, or an invalid configuration of the 
host to execute the received command properly. 

Second, the test set contained a significant number of 
Adware samples. The behavior graphs extracted from 
these samples are very similar to benign graphs; after 
all, Adware is in a grey area different from malicious 
bots. Thus, all graphs potentially covering these sam- 
ples are removed at the end of the mining process, when 
compared to the benign training sets. 
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The third reason for missed detections are malicious 
connections that are only seen a few times (possibly only 
in the test set). According to the AV labels, our data 
set covers 745 families (and an additional 4,096 samples 
that could not be labeled). Thus, certain families are rare 
in the data set. When a specific graph is only present a 
few times (or not at all) in the training set, it is possible 
that all of its subgraphs are below the mining threshold. 
In this case, we do not have a template that covers this 
activity. 

JACKSTRAWS also reported 7 benign graphs as ma- 
licious out of 4,100 connections in the benign test set: 
a false positive rate of 0.2%. Upon closer examination, 
these false positives correspond to large graphs where 
some Internet caching activity is observed. These graphs 
accidentally triggered four weaker templates with few 
core and many optional nodes. 

Overall, our results demonstrate that the host-based 
activity learned from a set of known C&C connections is 
successful in detecting other C&C connections that were 
produced by a same set of malware families, but also 
in detecting five related families that were only present 
in the test set. In a sense, this shows that C&C tem- 
plates have a similar detection capability as manually- 
generated, network-based signatures. 

We also wanted to understand the impact of template 
generalization compared to previous work that used di- 
rectly the mined subgraphs [13]. For this, we used the 
graphs mined from the malicious training set as signa- 
tures, without any generalization (this is the approach 
followed in previous work). Using a sub-isomorphism 
test for detection over the 4,262 malicious graphs in the 
test set, we found that the detection rate was 66%, 15.6% 
lower. This underlines that the novel template generation 
process provides significant benefits. 


Experiment 2: Unknown connections. For the next 
experiment, we decided to apply our templates to the 
graphs that correspond to unknown network traffic. This 
should demonstrate the ability of JACKSTRAWS to detect 
novel C&C connections within protocols not covered by 
any network-level signature. 

When applying our templates to the 98,018 unknown 
connections, we found 9,464 matches (9.7%). We manu- 
ally examined these connections in more detail to deter- 
mine whether the detection results are meaningful. The 
analysis showed that our approach is promising; the vast 
majority of connections that we analyzed had clear indi- 
cations of C&C activity. With the help of the anti-virus 
labels, we could identify 193 malware families which 
were not covered by the network signatures. The most 
prevalent new families were Hotbar (1984), Pakes (871), 
Kazy (107), and LdPinch (67). Furthermore, we de- 
tected several new variants of known bots that we did 
not detect previously because their network fingerprint 
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had changed and, thus, none of our signatures matched. 
Nevertheless, JACKSTRAWS was able to identify these 
connections due to matched templates. In addition, the 
manual analysis showed a low number of false positives. 
In fact, we only found 27 false positives out of the 9,464 
matches, all of them being HTTP connections. 

When comparing the number of our matches with the 
total number of unknown connections, the results may 
appear low at first glance. However, not all connec- 
tions in the unknown set are malicious. In fact, 10,524 
connections (10.7%) do not result in any relevant host- 
activity at all (the graphs only contain network-relayed 
system calls such as send or connect). For an- 
other 13,676 graphs (14.0%), the remote server did not 
send any data. For more than 7,360 HTTP connec- 
tions (7.5%), the server responded with status code 302, 
meaning that the requested content had moved. In this 
case, we probably cannot see any interesting behavior 
to match. In a few hundred cases, we also observed that 
the timeout of JACKSTRAWS interrupted the analysis too 
early (e.g., the connection downloaded a large file). In 
these cases, we usually miss some of the interesting be- 
havior. Thus, almost 30 thousand unknown connections 
can be immediately discarded as non-C&C traffic. 

Furthermore, the detection results of 9,464 new C&C 
connections for JACKSTRAWS need to be compared with 
the total number of 16,535 connections that the entire 
signature set was able to detect.Our generalized tem- 
plates were able to detect almost 60% more connec- 
tions than hundreds of hand-crafted signatures. Note 
that our C&C templates do not inspect network traffic at 
all. Thus, they can, by construction, detect C&C connec- 
tions regardless of whether the malware uses encryption 
or not, something not possible with network signatures. 


4.4 Template Quality 


The previous section has shown that our C&C templates 
are successful in identifying host-based activity related 
to both known and novel network connections. We also 
manually examined several templates in more detail to 
determine whether they capture activity that a human an- 
alyst would consider malicious. 

JACKSTRAWS was able to extract different kinds of 
templates. A few template examples are shown in Ap- 
pendix B. More precisely, out of the 417 templates, 
more than a hundred templates represent different forms 
of information leakage. The leaked information is origi- 
nally collected from dedicated registry keys or from spe- 
cific system calls (e.g., computer name, Windows ver- 
sion and identifier, Internet Explorer version, current 
system time, volume ID of the hard disk, or processor 
information). About fifty templates represent executable 
file downloads or updates of existing files. Additional 
templates include process execution: downloaded data 
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that is injected into a process and then executed. Five 
templates also represent complete download and exe- 
cute commands. The remaining templates cover vari- 
ous other malicious activities, including registry modi- 
fications ensuring that the sample is started on certain 
events (e.g., replacing the default executable file handler 
for Windows Explorer) and for hiding malware activity 
(e.g., clearing the MUICache). 

We also found 20 “weak” templates (out of 417). 
These templates contain a small number of nodes and 
do not seem related to any obvious malicious activity. 
However, these templates did not trigger any false pos- 
itive in the benign test set. This indicates that they still 
exhibit enough discriminative power with regards to our 
malicious and benign graph sets. 


5 Related Work 


Given the importance and prevalence of malware, it is 
not surprising that there exists a large body of work 
on techniques to detect and analyze this class of soft- 
ware. The different techniques can be broadly divided 
into host-based and network-based approaches, and we 
briefly describe the related work in the following. 


Host-based detection. Host-based detection techniques 
include systems such as traditional anti-virus tools that 
examine programs for the presence of known mal- 
ware. Other techniques work by monitoring the execu- 
tion of a process for behaviors (e.g., patterns of system 
calls [12, 28, 32]) that indicate malicious activity. Host- 
based approaches have the advantage that they can col- 
lect a wealth of detailed information about a program 
and its execution. Unfortunately, collecting a lot of in- 
formation comes with a price; it incurs a significant per- 
formance penalty. Thus, detailed but costly monitoring 
is typically reserved for malware analysis, while detec- 
tion systems, which are deployed on end-user machines, 
resort to fast but imprecise techniques [43]. As a result, 
current anti-virus products show poor detection rates [4]. 

A suitable technique to model the host-based activ- 
ity of a program is a behavior graph. This approach 
has been successfully used in the past [5, 13, 26] and 
we also apply this technique. Recently, Fredrikson et 
al. introduced an approach to use graph mining on be- 
havior graphs in order to distinguish between malicious 
and benign programs [13]. Graph mining itself is a well- 
known technique [46—48] that we use as a building block 
of JACKSTRAWS. Compared to their work, we have an- 
other high-level goal: we want to learn which network 
connections are related to C&C traffic in an automated 
way. Thus we do not only focus on host-level activities, 
but also take the network-level view into account and 
correlate both. Furthermore, we also cluster the graphs 
and perform a generalization step to extract templates 
that describe the characteristics of C&C connections. 
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From a technical point of view, we perform a more fine- 
grained analysis by applying taint analysis instead of the 
coarse-grained analysis performed by [13]. 

BOTSWAT [41] analyzes how bots process network 
data by analyzing system calls and performing taint 
analysis. The system matches the observed behavior 
against a set of 18 manually generated behavior patterns. 
In contrast, we use mining and machine learning tech- 
niques to automatically generate C&C templates. From 
a technical point of view, BOTSWAT uses library-call- 
level taint analysis and, thus, might miss certain depen- 
dencies. In contrast, the data flow analysis support of 
JACKSTRAWS enables a more fine grained analysis of 
information flow dependency among system calls. 


Network-based detection. To complement host-based 
systems and to provide an additional layer for defense- 
in-depth, researchers proposed network-based detection 
techniques [15—18, 45, 49]. Network-based approaches 
have the advantage that they can cover a large num- 
ber of hosts without requiring these hosts to install any 
software. This makes deployment easier and incurs no 
performance penalty for end users. On the downside, 
network-based techniques have a more limited view 
(they can only examine network traffic and encryption 
makes detection challenging), and they do not work for 
malicious code that does not produce any network traffic 
(which is rarely the case for modern malware). 

Initially, network-based detectors focused on the ar- 
tifacts produced by worms that spread autonomously 
through the Internet. Researchers proposed techniques 
to automatically generate payload-based signatures that 
match the exploits that worms use to compromise remote 
hosts [25,27,29,31,39]. With the advent of botnets, mal- 
ware authors changed their modus operandi. In fact, bots 
rarely propagate by scanning for and exploiting vulnera- 
ble machines; instead, they are distributed through drive- 
by download exploits [36], spam emails [22], or file 
sharing networks [23]. However, bots do need to com- 
municate with a command and control infrastructure. 
The reason is that bots need to receive commands and 
updates from their controller, and also upload stolen data 
and status information. As a result, researchers shifted 
their efforts to developing ways that can detect and dis- 
rupt malicious traffic between bots and their C&C infras- 
tructure. In particular, researchers proposed approaches 
to identify (and subsequently block) the IP addresses and 
domains that host C&C infrastructures [42], techniques 
to generate payload signatures that match C&C connec- 
tions [15, 17,45], and anomaly-based systems to corre- 
late network flows that exhibit a behavior characteristic 
of C&C traffic [16, 18,49]. In a paper related to ours, 
Perdisci et al. studied how network traces of malware 
can be clustered to identify families of bots that perform 
similar C&C communication [34]. The clustering results 
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can be used to generate signatures, but their approach 
does not take into account that bots generate benign traf- 
fic or can even deliberately inject noise [1, 10, 11, 33]. 
Our work is orthogonal to this approach since we can 
precisely identify connections related to C&C traffic. 


6 Limitations 


We aim at analyzing malicious software, which is a hard 
task in itself. An attacker can use different techniques to 
interfere with the analysis environment which is of con- 
cern for us. Our approach relies on actually observing 
the network communication of the sample to build the 
corresponding behavior graph. Thus, we need to con- 
sider attacks against the dynamic analysis environment, 
and, specifically, the taint analysis, since this component 
allows us to analyze the interdependence of network and 
host activities. Several techniques have been introduced 
in the past to enhance the analysis capabilities, for ex- 
ample, multi-path execution [30] or the analysis of VM- 
aware samples [2]. These and similar methods can be 
integrated in JACKSTRAWS so that the dynamic analysis 
process produces more extensive analysis reports. Note, 
however, that the evaluation results demonstrate that we 
can successfully, and in a large scale, analyze complex, 
real-world malware samples. This indicates that the pro- 
totype version of JACKSTRAWS already provides a ro- 
bust framework for performing our analysis 

Of course, an attacker might develop techniques to 
thwart our analysis, for example, by interleaving unnec- 
essary system calls with the calls that represent the ac- 
tual, malicious activity. The resulting, additional nodes 
might hinder the mining process and prevent the extrac- 
tion of a graph core. An attacker might also try to in- 
troduce duplicate nodes to launch complexity attacks, 
since most of the graph algorithms used in JACKSTRAWS 
are known to be NP-complete [6]. However, interleaved 
calls have to share some data dependencies with relevant 
system calls, otherwise, they would be stripped from 
the behavior graph. Moreover, they must be specifically 
crafted to escape the collapsing mechanism. Another ap- 
proach to disturb the analysis is to mutate the sequence 
of system calls that implement a behavior, as discussed 
in [21]. A possible solution to this kind of attacks is to 
normalize the behavior graphs in input using rewriting 
techniques. That is, semantically equivalent graph pat- 
terns are rewritten into a canonical form before mining. 


7 Conclusion 


In this paper, we focused on the problem of identifying 
actual C&C traffic when analyzing binary samples. Dur- 
ing a dynamic analysis run, bots do not only communi- 
cate with their C&C infrastructure, but they often open 
also a large number of benign network connections. We 
introduced JACKSTRAWS, a tool that can identify C&C 
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traffic in a sound way. This is achieved by correlating 
network traffic with the associated host behavior. 

With the help of experiments, we demonstrated the 
different templates we extracted and showed that we 
can even infer information about unknown bot families 
which we did not recognize before. On the one hand, we 
showed that our approach can be applied to proprietary 
protocols, which demonstrates that it is protocol agnos- 
tic. On the other hand, we also applied JACKSTRAWS 
to HTTP traffic, which is challenging since we need to 
reason about small differences between legitimate and 
malicious usage of the Windows API. The results show 
that we can still extract precise templates in this case. 
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A Graph Labeling and Abstraction 


Nodes and edges that are inserted into the behavior 
graph are augmented with additional labels that capture 
more information about the nature of the system calls 
and the dependencies between nodes. In the following, 
we describe this labeling in greater detail. For edges, 
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a label stores the names of the input and output argu- 
ments, respectively, of the system calls that are con- 
nected through a data dependency. In case of a node, 
the label stores the system call name and some optional 
information that depends on the specific type of call. 

As shown in Table 1, the additional information can 
correspond to the type of the resource (files, registry 
keys, ...) that a system call operates on as well as flags 
(such as mode or permission bits for file operations). 
Note that some information is only stored as comments; 
this information is ignored for the template generation 
and matching, but is saved for a human analyst who 
might want to examine a template. 











Operations File Registry Network 

Label Location, Type (Table 2), Key name, Port 
Access, Attributes, Value name 
CreateDisposition 

Comment File name IP address 














Table 1: Selected information for labels and comments. 


One important additional piece of information stored 
for system calls that manipulate files and registry keys is 
the name of these files and keys. However, for these re- 
source names, it is not desirable to use the actual string. 
The reason is that labels are taken into account during 
the matching process, and two nodes are considered the 
same only when their labels match. Thus, some type of 
abstraction is necessary for labels that represent resource 
names, otherwise, graphs become too specific. 

In the case of files, the name string is split into three 
parts: the path representing the location of the file, the 
short name of the file and its extension. Table 2 shows 
how the paths, short names and extensions are mapped 
to several generic classes of location and type, that are 
then used for the file name label. Similarly, the registry 
key names are split into two parts: the location of the key 
and its short name. The location is first normalized using 
the standard registry abbreviations (HKLM, HKU, HKCU, 
HKCR). The short key name is then confronted to generic 
types (number, path, url). If the name does not comply 
with any format, but still shows a high number of simi- 
lar close variations, a generic type random is attributed. 
Additional examples of this abstraction process can be 
observed in the examples of template of the next section. 


B_ Template Examples 


We manually examined C&C templates to determine 
whether they capture activity that a human analyst would 
consider malicious. We now present two examples that 
were automatically generated by JACKSTRAWS. 

Figure 4 shows a template we extracted from bots 
that use a proprietary, binary protocol for communicat- 
ing with the C&C server. The behavior corresponds to 
some kind of information leakage: the samples query the 
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Location File Path Type Extension 
InWindowsDirectory \ \Windows\ IsExecutable *exe 
InSystemDirectory\ \ Windows \ System*\ IsDynamicLibrary | *.dil 
InDocumentDirectory\ \Documents and Settings\ IsDriver * sys, *.drv 
InStartupDirectory \ \Documents and Settings\*\Startup\ IsConfiguration *ini, *cfg 
InTemporaryDirectory\ | \Documents and Settings\*\Local Settings\Temp\, IsWebPage * htm, *.php, *xml 
InInternetDirectory\ \Documents and Settings\*\Local Settings\Temporary Internet Files\ IsScript *js, *vbs 
InProgramDirectory\ \Program Files\ IsCookie \ Cookies\ *@ *. txt 
IsDevice \Device\ 
IsNetworkDevice \Device\ AfdEndPoint 


Table 2: File abstraction based on location and type. 


a ; » 
systemcall: NtAllocateVirtualMemory 


os 























arg: ObjectAttributes=buf arg: ip=buf 


arg: ObjectAttributes=buf| arg: ObjectAttributes=RegionSize 


arg: Socket=Socket 


arg: FileHandle=FileHandle arg: FileHandle=FileHandle 







arg: Buffer=buf 
arg: Length=buf 





arg: Filelnformation=buf/ arg: InputBuffer=bu arg: buf=buf 


systemcall: NtDeviceloControlFile 
*, 


























———— 


Figure 5: Template that describes the download and execute functionality of a bot: an executable file is created, its 
content is downloaded from the network, decoded, written to disk, its information is modified before being executed. 
In the NtCreateFile node, the file name /drexe is only mentioned as a comment. Comments help a human analyst 
when looking at a template, but they are ignored by the matching. 


registry for the computer name and send this information 
via the network to a server. We consider this a malicious 
activity, which is often used by bots to generate a unique 
identifier for an infected machine. In the network traffic 
itself this activity cannot be easily identified, since the 
samples use their own protocol. 

As another example, consider the template shown in 
network: connect Figure 5. This template corresponds to the download & 
pata execute behavior, i.e., data is downloaded from the net- 
work, written to disk, and then executed. The template 
describes this specific behavior in a generic way. 





arg: ObjectAttributes=KeyHandle 





arg: KeyHandle=KeyHandle 






























arg: buf=KeyValuelnformation / Socket=Socket 


Figure 4: Template that describes leaking of sensitive 
data. Darker nodes constitute the template core, whereas 
lighter ones are optional. 
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Abstract 


In this paper, we present Telex, a new approach to 
resisting state-level Internet censorship. Rather than at- 
tempting to win the cat-and-mouse game of finding open 
proxies, we leverage censors’ unwillingness to completely 
block day-to-day Internet access. In effect, Telex converts 
innocuous, unblocked websites into proxies, without their 
explicit collaboration. We envision that friendly ISPs 
would deploy Telex stations on paths between censors’ 
networks and popular, uncensored Internet destinations. 
Telex stations would monitor seemingly innocuous flows 
for a special “tag” and transparently divert them to a for- 
bidden website or service instead. We propose a new 
cryptographic scheme based on elliptic curves for tagging 
TLS handshakes such that the tag is visible to a Telex 
station but not to a censor. In addition, we use our tagging 
scheme to build a protocol that allows clients to connect 
to Telex stations while resisting both passive and active at- 
tacks. We also present a proof-of-concept implementation 
that demonstrates the feasibility of our system. 


1 Introduction 


The events of the Arab Spring have vividly demonstrated 
the Internet’s power to catalyze social change through 
the free exchange of ideas, news, and other information. 
The Internet poses such an existential threat to repressive 
regimes that some have completely disconnected from 
the global network during periods of intense political un- 
rest, and many regimes are pursuing aggressive programs 
of Internet censorship using increasingly sophisticated 
techniques. 

Today, the most widely-used tools for circumventing 
Internet censorship take the form of encrypted tunnels 
and proxies, such as Dynaweb [12], Instasurf [30], and 
Tor [10]. While these designs can be quite effective at 
sneaking client connections past the censor, these systems 
inevitably lead to a cat-and-mouse game in which the 
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censor attempts to discover and block the services’ IP 
addresses. For example, Tor has recently observed the 
blocking of entry nodes and directory servers in China 
and Iran [28]. Though Tor is used to skirt Internet censors 
in these countries, it was not originally designed for that 
application. While it may certainly achieve its original 
goal of anonymity for its users, it appears that Tor and 
proxies like it are ultimately not enough to circumvent 
aggressive censorship. 

To overcome this problem, we propose7e/ex: an “end- 
to-middle” proxy with no IP address, located within the 
network infrastructure. Clients invoke the proxy by using 
public-key steganography to “tag” otherwise ordinary 
TLS sessions destined for uncensored websites. Its design 
is unique in several respects: 


Architecture Previous designs have assumed that anti- 
censorship services would be provided by hosts at the 
edge of the network, as the end-to-end principle requires. 
We propose instead to provide these services in the core 
infrastructure of the Internet, along paths between the 
censor’s network and popular, nonblocked destinations. 
We argue that this will provide both lower latency and 
increased resistance to blocking. 


Deployment Many systems attempt to combat state- 
level censorship using resources provided primarily by 
volunteers. Instead, we investigate a government-scale 
response based on the view that state-level censorship 
needs to be combated by state-level anticensorship. 


Construction We show how a technique that the security 
and privacy literature most frequently associates with 
government surveillance—deep-packet inspection—can 
provide the foundation for a robust anticensorship system. 


We expect that these design choices will be somewhat 
controversial, and we hope that they will lead to discus- 
sion about the future development of anticensorship sys- 
tems. 
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Contributions and roadmap We propose using “end- 
to-middle” proxies built into the Internet’s network in- 
frastructure as a novel approach to resisting state-level 
censorship. We elaborate on this concept and sketch the 
design of our system in Section 2, and we discuss its 
relation to previous work in Section 3. 

We develop a new steganographic tagging scheme 
based on elliptic curve cryptography, and we use it to 
construct a modified version of the TLS protocol that 
allows clients to connect to our proxy. We describe the 
tagging scheme in Section 4 and the protocol in Section 5. 
We analyze the protocol’s security in Section 6. 

We present a proof-of-concept implementation of our 
approach and protocols, and we support its feasibility 
through laboratory experiments and real-world tests. We 
describe our implementation in Section 7, and we evaluate 
its performance in Section 8. 


Online resources For the most recent version of this 
paper, prototype source code, and a live demonstration, 
visit https://telex.cc. 


2 Concept 


Telex operates as what we term an “end-to-middle” proxy. 
Whereas in traditional end-to-end proxying the client con- 
nects to a server that relays data to a specified host, in 
end-to-middle proxying an intermediary along the path 
to a server redirects part of the connection payload to 
an alternative destination. One example of this mode of 
operation is Tor’s leaky-pipe circuit topology [10] fea- 
ture, which allows traffic to exit from the middle of a 
constructed Tor circuit rather than the end. 

The Telex concept is to build end-to-middle proxying 
capabilities into the Internet’s routing infrastructure. This 
would let clients invoke proxying by establishing connec- 
tions to normal, pre-existing servers. By applying this 
idea to a widely used encrypted transport, such as TLS, 
and carefully avoiding observable deviations from the 
behavior of nonproxied connections, we can construct a 
service that allows users to robustly bypass network-level 
censorship without being detected. 

In the remainder of this section, we define a threat 
model and goals for the Telex system. We then give a 
sketch of the design and discuss several practical consid- 
erations. 


2.1 Threat model 


Our adversary, “the censor’, is a repressive state-level au- 
thority that desires to inhibit online access to information 
and communication of certain ideas. These desires are 
realized by IP and DNS blacklists as well as heuristics for 
blocking connections based on their observed content. 
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We note that the censor has some motivation for con- 
necting to the Internet at all, such as the economic and 
social benefits of connectivity. Thus, the censor bears 
some cost from over-blocking. We assume that the cen- 
sor follows a blacklist approach rather than a whitelist 
approach in blocking, allowing traffic to pass through 
unchanged unless it is explicitly banned. 


Furthermore, we assume that the censor generally per- 
mits widespread cryptographic protocols, such as TLS, ex- 
cept when it has reason to believe a particular connection 
is being used for skirting censorship. We further assume 
that the censor is not subverting such protocols on a wide 
scale, such as by requiring a cryptographic backdoor or 
by issuing false TLS certificates using a country-wide CA. 
We believe this is reasonable, as blocking or subverting 
TLS on a wide scale would render most modern websites 
unusably insecure. Subversion in particular would result 
in an increased risk of large-scale fraud if the back door 
were compromised or abused by corrupt insiders. 


The censor controls the infrastructure of the network 
within its jurisdiction (“the censor’s network’), and it 
can potentially monitor, block, alter, and inject traffic 
anywhere within this region. However, these abilities 
are subject to realistic technical, economic, and political 
constraints. 


In general, the censor does not control end hosts within 
its network, which operate under the direction of their 
users. We believe this assumption is reasonable based 
on the failure of recent attempts by national governments 
to mandate client-side filtering software, such as China’s 
Green Dam Youth Escort [33]. The censor might target 
a small subset of users and seize control of their devices, 
either through overt compulsion or covert technical at- 
tacks. Protecting these users is beyond the scope of our 
system. However, the censor’s targeting users on a wide 
scale might have unacceptable political costs. 


The censor has very limited abilities outside its network. 
It does not control any external network infrastructure or 
any popular external websites the client may use when 
communicating with Telex stations. The censor can, of 
course, buy or rent hosting outside its network, but its 
use is largely subject to the policies of the provider and 
jurisdiction. 

Some governments may choose to deny their citizens 
Internet connectivity altogether, or disconnect entirely 
in times of crisis. These are outside our threat model; 
the best approaches to censors like these likely involve 
different approaches than ours, and entail much steeper 
performance trade-offs. Instead, our goal is to make ac- 
cess to any part of the global Internet sufficient to access 
every part of it. In other words, we aim to make connect- 
ing to the global Internet an all-or-nothing proposition for 
national governments. 
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Figure 1: Telex Concept— This figure shows an example user connecting to Telex. The client makes a tagged 
connection to NotBlocked.com, which is passed by the censor’s filter. When the request reaches a friendly on-path 
ISP, one of the ISP’s routers forwards the request to the Telex station connected to its tap interface. Telex deciphers 
the tag, instructs the router to block the connection to NotBlocked.com. and diverts the connection to Blocked.com, 
as the user secretly requested. If the connection were not tagged, Telex would not intervene, and it would proceed to 


NotBlocked.com as normal. 


2.2 Goals 


Telex should satisfy the following properties: 


Unblockable The censor should not be able to deny 
service to Telex without incurring unacceptable costs. In 
particular, we require that the censor cannot block Telex 
without blocking a large, primarily legitimate category of 
Internet traffic. 


Confidential The censor should not be able to deter- 
mine whether a user is using Telex or what content the 
user is accessing through the system. 


Easy to deploy The consequences of system failure 
(or even normal operation) must not interfere with normal 
network operation (e.g., non-Telex connections) in order 
for deployment to be palatable to ISPs. 


Transparent to users Using Telex should, possibly 
after a small startup procedure, closely resemble using an 
unfiltered Internet connection. 


2.3 Design 


To meet our goals and the constraints imposed by our 
threat model, we propose the design shown in Figure 1. 
As illustrated in the figure, a Telex connection proceeds 
as follows: 
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1. The user’s client selects an appropriate website that 
is not on the censor’s blacklist and unlikely to at- 
tract attention, which we represent by the domain 
NotBlocked.com. 

2. The user connects to NotBlocked.com via HTTPS. 
Her Telex client! includes an invisible “tag,” which 
looks like an expected random nonce to the censor, 
but can be cryptographically verified by the Telex 
station using its private key. 

3. Somewhere along the route between the client and 
NotBlocked.com, the connection traverses an ISP 
that has agreed to attach a Telex station to one of its 
routers. The connection is forwarded to the station 
via a dedicated tap interface. 

4. The station detects the tag and instructs the router to 
block the connection from passing through it, while 
still forwarding packets to the station through its 
dedicated tap. (Unlike a deployment based on trans- 
parent proxying, this configuration fails open: it 
tolerates the failure of the entire Telex system and so 
meets our goal of being easy to deploy.) 

5. The Telex station diverts the flow to Blocked.com as 


lWe anticipate that client software will be distributed out of band, 
perhaps by sneakernet, among mutually trusting individuals within the 
censor’s domain. 
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the user requested; it continues to actively forward 
packets from the client to Blocked.com and vice 
versa until one side terminates the connection. If the 
connection were untagged, it would pass through the 
ISP’s router as normal. 

We simplified the discussion above in an important 
point: we need to specify what protocol is to be used over 
the encrypted tunnel between the Telex client and the 
Telex station and how the client communicates its choice 
of Blocked.com. Layering IP atop the tunnel might seem 
to be a natural choice, yielding a country-wide VPN of 
sorts, but even a passive attacker may be able to differen- 
tiate VPN traffic patterns from those of a normal HTTPS 
connection. As a result, we primarily envision using Telex 
for protocols whose session behavior resembles that of 
HTTPS. For example, an HTTP or SOCKS proxy would 
be a useful application, or perhaps even a simple server 
that presented a list of entry points for another anticen- 
sorship system such as Tor [10]. In the remainder of this 
paper, we assume that the application is an HTTP proxy. 

The precise placement of Telex stations is a second 
issue. Clearly, a chief objective of deployment is to cover 
as many paths between the censor and popular Internet 
destinations as possible so as to provide a large selection 
of sites to play the role of NotBlocked.com. We might ac- 
complish this either by surrounding the censor with Telex 
stations or by placing them close to clusters of popular 
uncensored destinations. In the latter case, care should 
be taken not to reduce the size of the cluster such that the 
censor would only need to block a small number of other- 
wise desirable sites to render the station useless. Which 
precise method of deployment would be most effective 
and efficient is, in part, a geopolitical question. 

A problem faced by existing anticensorship systems 
is providing sufficient incentives for deployment [6]. 
Whereas systems that require cooperation of uncensored 
websites create a risk that such sites might be blocked 
by censors in retaliation, our system requires no such 
participation. We envision that ISPs will willingly deploy 
Telex stations for a number of reasons, including idealism, 
goodwill, public relations, or financial incentives (e.g., 
tax credits) provided by governments. At worst, the con- 
sequences to ISPs for participation would be depeering, 
but depeering a large ISP would have a greater impact 
on overall network performance than blocking a single 
website. 

Discovery of Telex stations is a third issue. With wide 
enough deployment, clients could pick HTTPS servers 
at random. However, this behavior might divulge clients’ 
usage of Telex, because real users don’t actually visit 
HTTPS sites randomly. A better approach would be to 
opportunistically discover Telex stations by tagging flows 
during the course of the user’s normal browsing. When a 
station is eventually discovered, it could provide a more 
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comprehensive map of popular sites (where popularity is 
as measured with data from other Telex users) such that a 
Telex station is likely to be on the path between the user 
and the site. Even with only partial deployment, users 
would almost certainly discover a Telex station eventually. 


3 Previous Work 


There is a rich literature on anonymous and censorship- 
resistant communication, going back three decades [7]. 
One of the first systems explicitly proposed for combating 
wide-scale censorship was Infranet [13], where participat- 
ing websites would discreetly provide censored content in 
response to steganographic requests. Infranet’s designers 
dismissed the use of TLS because, at the time, it was not 
widely deployed and would be easily blocked. We observe 
that this aspect of Internet use has substantially changed 
since 2002. Unlike Infranet, Telex does not require the 
cooperation of unblocked websites—a significant imped- 
iment to deployment—which participate in our system 
only as oblivious cover destinations. 

A variety of systems provide low-latency censorship 
resistance through VPNs or encrypted tunnels to proxies. 
These systems rely on servers at the edge of the network, 
which censors constantly try to find and block (via IP). By 
far, the best studied of these systems is Tor [10], which 
also attempts to make strong anonymity guarantees by 
establishing a multi-hop encrypted tunnel. Traditionally, 
users connect to Tor via a limited set of “entry nodes,” 
which provide an obvious target for censors. In response, 
Tor has implemented bridges [27], which are a variation 
on Feamster et al.’s keyspace hopping [14], in which each 
client is told only a small subset of addresses of available 
proxies. While bridges provide an extra layer of protec- 
tion, the arms race remains: Chinese censors now learn 
and block a large fraction of bridge nodes [9], possibly by 
using a Sybil attack [11] against the bridge address distri- 
bution system. Like Telex, Tor adopts a pragmatic threat 
model that emphasizes performance; it wraps connections 
using TLS and does not strongly protect against traffic 
analysis and end-to-end timing attacks [22]. Unlike Tor, 
we separate the problem of censorship resistance from 
that of anonymous communication and concentrate on re- 
sisting blocking; users who require increased anonymity 
can use Telex as a gateway to the Tor network. 

The most widely-used anticensorship tools today are 
also among those that make the fewest security promises. 
Pragmatic systems such as Dynaweb [12] and Ultra- 
surf [30] that employ simple encrypted tunnels with large 
numbers of entry points are popular, and, so far, have man- 
aged to stay one step ahead of many censors. However, 
we worry that such systems will not be able to withstand 
continued research and development on the part of cen- 
sors (e.g., Sybil attacks for proxy IP discovery). We aim 
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Figure 2: Tag creation and detection — Telex intercepts TLS connections that contain a steganographic tag in the 
ClientHello message’s nonce field (normally a uniformly random string). The Telex client generates the tag using public 
parameters (shown above), but it can only be recognized by using the private key r embedded in the Telex station. 


to provide similar or better performance by adopting a 
single-hop tunnel and locating proxies in the middle of 
the network, where they are not susceptible to IP-based 
blocking. 


4 Tagging 


In this section, we describe how we implement the invis- 
ible tag for TLS connections, which only Telex stations 
can recognize. We present an overview here, while the 
details and a security argument appear in Appendix A. 
Figure 2 depicts the tagging scheme. 

Our tags must have two properties: they must be short, 
and they must be indistinguishable from a uniformly ran- 
dom string to anyone without the private key. Someone 
with the private key should be able to examine a random- 
looking value and efficiently decide whether the tag is 
present; if so, a shared secret key is derived for use later 
in the protocol. 

The structure of the Telex tagging system is based on 
Diffie-Hellman: there is a generator g of a group of prime 
order. Telex has a private key r and publishes a pub- 
lic key & = g’. The system uses two cryptographically 
secure hash functions H and Ap, each salted by the cur- 
rent context string x (see Section 5). To construct a tag, 
the client picks a random private key s, and computes 
g* and a = g’*. If || denotes concatenation, the tag is 
then g*||Hi(g”*||7), and the derived shared secret key is 
Ha(g" IZ). 

Diffie-Hellman can be implemented in many different 
groups, but in order to keep the tags both short and secure, 
we must use elliptic curve groups. Then we must ensure 
that, in whatever bit representation we use to transmit 
group elements g*, they are indistinguishable from uni- 
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formly random strings of the same size. This turns out to 
be quite tricky, for three reasons: 


e First, it is easy to tell whether a given (x,y) is a point 
on a (public) elliptic curve. Most random strings will 
not appear to be such a point. To work around this, 
we only transmit the x-coordinates of the elliptic 
curve points. 


e Second, it is the case that these x-coordinates are 
taken modulo a prime p. Valid tags will never con- 
tain an x-coordinate larger than p, so we must ensure 
that random strings of the same length as p are ex- 
tremely unlikely to represent a value larger than p. 
To accomplish this, we select a value of p that is 
only slightly less than a power of 2. 


e Finally, it turns out that for any given elliptic 
curve, only about half of the numbers mod p are 
x-coordinates of points on the curve. This is unde- 
sirable, as no purported tag with an x-coordinate not 
corresponding to a curve point can possibly be valid. 
(Conversely, if a given client is observed using only 
x-coordinates corresponding to curve points, it is 
very likely using Telex.) To solve this, we use two 
elliptic curves: the original curve and a related one 
called the “twist”. These curves have the property 
that every number mod p is the x-coordinate of a 
point on either the original curve or the twist. We 
will now need two generators: go for the original 
curve, and g; for the twist, along with the corre- 
sponding public keys % = gp and a = g}. Clients 
pick one pair (g,,Q) uniformly at random when 
constructing tags. 

When Telex receives a candidate tag, it divides it into 

two parts as B||h, according to the fixed lengths of group 
elements and hashes. It also determines the current con- 
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ee ServerHello 
Certificate 


ServerKeyExchange 


“< ServerHelloDone 
ClientKeyExchange 


ChangeCipherSpec 
[Finished] _____»»>_ 
ChangeCipherSpec 


<—___ [Finished] 


Figure 3: TLS Handshake — The client and server ex- 
change messages to establish a shared master_secret, from 
which they derive cipher and MAC keys. The handshake 
ends with each side sending a Finished message, en- 
crypted with the negotiated keys, that includes an integrity 
check on the entire handshake. The ServerKeyExchange 
message may be omitted, depending on the key exchange 
method in use. 


text string 7. If this is a valid tag, B will be g} and h 
will be Hj(g;*||7%) for some s and b. If this is not a valid 
tag, B and h will both be random. Thus, Telex simply 


checks whether h = H (B"||7). This will always be true 
for valid tags, and will be true only with probability 2-H 
for invalid tags, where £7, is the bit length of the outputs 
of H. If it is true, Telex computes the shared secret key 


as H2(B"||x). 


5 Protocol 


In this section, we briefly describe the Transport Layer 
Security (TLS) protocol [8] and then we explain our mod- 
ifications to it. 


5.1 Overview of TLS 


TLS provides a secure channel between a client and a 
server, and consists of two sub-protocols: the handshake 
protocol and the record protocol. The handshake protocol 
provides a mechanism for establishing a secure channel 
and its parameters, including shared secret generation 
and authentication. The record protocol provides a se- 
cure channel based on parameters established from the 
handshake protocol. 

During the TLS handshake, the client and server agree 
on a cipher suite they will use to communicate, the server 
authenticates itself to the client using asymmetric certifi- 
cates (such as RSA), and cryptographic parameters are 
shared between the server and client by means of a key 
exchange algorithm. While TLS supports several key 
exchange algorithms, in this paper, we will focus on the 
Diffie-Hellman key exchange. 
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Figure 3 provides an outline of the TLS handshake. We 
describe each of these messages in detail below: 
ClientHello contains a 32-byte nonce, a session identifier 
(0 if a session is not being resumed), and a list of sup- 
ported cipher suites. The nonce consists of a 4-byte Unix 
timestamp, followed by a 28-byte random value. 


ServerHello contains a 32-byte nonce formed identically 
to that in the ClientHello as well as the server’s choice of 
one of the client’s listed cipher suites. 


Certificate contains the X.509 certificate chain of the 
server, and authenticates the server to the client. 


ServerKeyExchange provides the parameters for the 
Diffie-Hellman key exchange. These parameters include 
a generator g, a large prime modulus ppy, a server pub- 
lic key, and a signature. As per the Diffie-Hellman key 
exchange, the server public key is generated by comput- 
ing g*Prv mod ppx, where Spyiy is a large random number 
generated by the server. The signature consists of the 
RSA signature (using the server’s certificate private key) 
over the MD5 and SHA-1 hashes of the client and server 
nonces, and previous Diffie-Hellman parameters. 


ServerHelloDone is an empty record, used to update the 
TLS state on the receiving (i.e., client) end. 


ClientKeyExchange contains the client’s Diffie-Hellman 
parameter (the client public key generated by g“»"” mod 
Pou): 

ChangeCipherSpec alerts the server that the client’s 
records will now be encrypted using the agreed upon 
shared secret. The client finishes its half of the handshake 
protocol with an encrypted Finished message, which veri- 
fies the cipher spec change worked by encrypting a hash 
of all previous handshake messages. 


5.2. Telex handshake 


The Telex handshake has two main goals: first, the censor 
should not be able to distinguish it from a normal TLS 
handshake; second, it should position the Telex station as 
a man-in-the-middle on the secure channel. We now de- 
scribe how the Telex handshake deviates from a standard 
TLS handshake. 


Client setup The client selects an uncensored HTTPS 
server located outside the censor’s network (canonically, 
https://NotBlocked.com) and resolves its hostname to find 
server_ip. This server may be completely oblivious to 
the anticensorship system. The client refers to its database 
of Telex stations’ public keys to select the appropriate key 
P = (Q,@) for this session. We leave the details of 
selecting the server and public key for future work. 


ClientHello message The client generates a fresh 
tag T by applying the algorithm specified in Section 4, 
using public key P and a context string composed 
of server_ip||UNIX_timestamp||TLS_session_id. 
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This yields a 224-bit tag t and a 128-bit shared secret key 
ksn. The client initiates a TCP connection to server_ip 
and starts the TLS handshake. As in normal TLS, the 
client sends a ClientHello message, but, in place of the 
224-bit random value, it sends T. 

(Briefly, the tag construction ensures that the Telex 
station can use its private key to efficiently recognize T 
as a valid tag and derive the shared secret key ksy, and 
that, without the private key, the distribution of T values 
is indistinguishable from uniform; see Section 4.) 

If the path from the client to server_ip passes through 
a link that a Telex station is monitoring, the station ob- 
serves the TCP handshake and ClientHello message. It 
extracts the nonce and applies the tag detection algorithm 
specified in Section 4 using the same context string and 
its private key. If the nonce is a genuine tag created with 
the correct key and context string, the Telex station learns 
ks» and continues to monitor the handshake. Otherwise, 
with overwhelming probability, it rejects the tag and stops 
observing the connection. 


Certificate validation The server responds by send- 
ing its X.509 certificate and, if necessary, key exchange 
values. The client verifies the certificate using the CA 
certificates trusted by the user’s browser. It addition- 
ally checks the CA at the root of the certificate chain 
against a whitelist of CAs trusted by the anticensorship 
service. If the certificate is invalid or the root CA is not on 
the whitelist, the client proceeds with the handshake but 
aborts its Telex invocation by strictly following the TLS 
specification and sending an innocuous application-layer 
request (e.g.,GET / HTTP/1.1 for HTTPS).” 


Key exchange At this point in the handshake, the client 
participates in the key exchange to compute a master se- 
cret shared with the server. We modify the key exchange 
in order to “leak” the negotiated key to the Telex station. 
Several key exchange algorithms are available. For exam- 
ple, in RSA key exchange, the client generates a random 
46-byte master key and encrypts it using the server’s pub- 
lic key. Alternatively, the client and server can participate 
in a Diffie-Hellman key exchange to derive the master 
secret. 

The Telex client, rather than generating its key ex- 
change values at random, seeds a secure PRG with key, 
and uses its output for whatever randomness is required 
in the key exchange algorithm (e.g., the Diffie-Hellman 
exponent). If a Telex station has been monitoring the 
connection to this point, it will know all the inputs to the 
client’s key exchange procedure: it will have observed 
the server’s key exchange parameter and computed the 
client’s PRG seed k,;,. Using this information, the Telex 


2Both the additional root CA whitelist and the browser list need to be 
checked; the censor may control a CA that is commonly whitelisted by 
browsers, and the root CA whitelist may contain entries that are trusted 
by one browser but not another. 
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station simulates the client and simultaneously derives the 
same master secret. 


Handshake completion _ If a Telex station is listening, 
it attempts to decrypt each side’s Finished message. The 
station should be able to use the master secret to decrypt 
them correctly and verify that the hashes match its obser- 
vations of the handshake. If either hash is incorrect, the 
Telex station stops observing the connection. Otherwise, 
it switches roles from a passive observer to a man-in-the- 
middle. It forges a TCP RST packet from the client to 
NotBlocked.com, blocks subsequent messages from ei- 
ther side from reaching the remote end of the connection, 
and assumes the server’s role in the unbroken TCP/TLS 
connection with the client. 


Session resumption Once a client and server have es- 
tablished a session, TLS allows them to quickly resume 
or duplicate the connection using an abbreviated hand- 
shake. Our protocol can support this too, allowing the 
Telex station to continue its role as a man-in-the-middle. 

The station remembers key and session_id by the 
server, for sessions it successfully joined. A client at- 
tempts to resume the session on a new connection by send- 
ing a ClientHello message containing the session_id 
and a fresh tag t’, which Telex can observe and verify if 
it is present. If the server agrees to resume the session, 
it responds with a ServerHello message and a Finished 
message encrypted with the original master secret. The 
client then sends its own Finished message encrypted in 
the same way, which confirms that it knows the original 
master secret. The Telex station checks that it can decrypt 
and verify these messages correctly, then switches into a 
man-in-the-middle role again. 


6 Security Analysis 


In this section, we analyze Telex’s security under the 
threat model described in Section 2.1. 


6.1 Passive attacks 


First, we consider a passive censor who is able to ob- 
serve arbitrary traffic within its network. For this censor 
to detect that a client is using Telex, it must be able to 
distinguish normal TLS flows from Telex flows. 

Telex deviates from a normal TLS handshake in the 
client’s nonce (sent in the ClientHello message) and in 
the client’s key exchange parameters. In Section 4, we 
showed that an attacker cannot distinguish a Telex tag 
from a truly random string with more than a negligible 
advantage. This means that a client’s tagged nonce (using 
Telex) is indistinguishable from a normal TLS random 
nonce. Likewise, the Telex-generated key exchange pa- 
rameters are the output of a secure PRG; they are not 
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distinguishable from truly random strings as a direct re- 
sult of the security of the PRG. 

During the TLS record protocol, symmetric cryptogra- 
phy is used between the Telex station and the client. A 
censor will be unable to determine the contents of this 
encrypted channel, as in normal TLS, and will thus be un- 
able to distinguish between a Telex session and a normal 
TLS session from the cryptographic payload alone. 


Stream cipher weakness TLS supports several stream 
cipher modes for encrypting data sent over the connec- 
tion. Normally, the key stream is used once per session, to 
avoid vulnerability to a reused key attack. However, the 
Telex station and NotBlocked.com use the same shared 
secret when sending data to the client, so the same key 
stream is used to encrypt two different plaintexts. An 
attacker (possibly different from the censor) with the abil- 
ity to receive both of the resulting ciphertexts can simply 
XOR them together to obtain the equivalent of the plain- 
texts XORed together. To mitigate this issue, Telex sends 
a TCP RST to NotBlocked.com to quickly stop it from 
returning data. In addition, our implementation uses a 
block cipher in CBC mode, for which TLS helps mitigate 
these issues further by providing for the communication 
of a random per-record IV. 

We note that an adversary in position to carry out this 
attack (such as one surrounding the Telex station) already 
has the ability to detect the client’s usage of Telex, as 
well as the contents of the connection from Telex to 
Blocked.com. 


Traffic analysis A sophisticated adversary might at- 
tempt to detect a use of Telex by detecting anomalous 
patterns in connection count, packet size, and timing. Pre- 
vious work shows how these characteristics can be used to 
fingerprint and identify specific websites being retrieved 
over TLS [18]. However, this kind of attack would be 
well beyond the level of sophistication observed in current 
censors [16]. We outline a possible defense against traffic 
analysis in Section 9. 


6.2 Active attacks 


Our threat model also allows the censor to attempt a vari- 
ety of active attacks against Telex. The system provides 
strong defenses against the most practical of these attacks. 


Traffic manipulation The censor might attempt to 
modify messages between the client and the Telex sta- 
tion, but Telex inherits defenses against this from TLS. 
For example, if the attacker modifies any of the param- 
eters in the handshake messages, the client and Telex 
station will each detect this when they check the MACs in 
the Finished messages, which are protected by the shared 
secret of the TLS connection. Telex will then not intercept 
the connection, and the NotBlocked.com server will re- 
spond with a TLS error. Widescale manipulation of TLS 
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handshakes or payloads would disrupt Telex; however, it 
would also interfere with the normal operation of TLS 
websites. 


Tag replay The censor might attempt to use various 
replay attacks to detect Telex usage. The most basic of 
these attacks is for the censor to initiate its own Telex 
connection and reuse the nonce from a suspect connec- 
tion; if this connection receives Telex service, the censor 
can conclude that the nonce was tagged and the original 
connection was a Telex request. 

Our protocol prevents this by requiring the client to 
prove to the Telex station that it knows the shared secret 
associated with the tagged nonce. We achieve this by 
using the shared secret to derive the key exchange param- 
eter, as described in Section 5. In particular, consider 
the encrypted Finished message that terminates the TLS 
handshake. This message must be encrypted using the 
freshly negotiated key (or else the TLS server will hang 
up), so it cannot simply be replayed. Second, the key 
exchange parameter in use must match the shared secret 
in the tagged nonce, or the Telex station will not be able 
to verify the MAC on the Finished message. Together, 
these requirements imply that the client must know the 
shared secret. 


Handshake replay This property of proving know!- 
edge of the shared secret is only valid if the server pro- 
vides fresh key exchange parameters. An attacker may 
circumvent this protection by replaying traffic in both di- 
rections across the Telex station. This attack will cause a 
visible difference in the first ApplicationData message re- 
ceived at the client, provided that either 1) Blocked.com’s 
response is not completely static (e.g., it sets a session 
cookie) or 2) the original connection being replayed was 
an unsuccessful Telex connection. In either case, the 
new ApplicationData message will be fresh data from 
Blocked.com. 

A partial defense against this attack is to enforce fresh- 
ness of the timestamps used in both halves of the TLS 
handshake and prohibit nonce reuse within the window 
of acceptable timestamps. However, this defense fails 
in the case where the original connection being replayed 
was an unsuccessful attempt to initiate a Telex connec- 
tion, because the Telex station did not see the first use 
of the nonce. As a further defense, we note that Not- 
Blocked.com will likely not accept replayed packets, and 
the Telex station can implement measures to detect at- 
tempts to prevent replayed packets from reaching Not- 
Blocked.com. 


Ciphertext comparison The attacker is able to detect 
the use of Telex if they are able to receive the unaltered 
traffic from NotBlocked.com, in addition to the traffic 
they forward to the client. Though they will not be able 
to decrypt either of the messages, they will be able to see 
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that the ciphertexts differ, and from this conclude that a 
client is using Telex. Normally, Telex blocks the traffic 
between NotBlocked.com and the client after the TLS 
handshake to prevent this type of attack. 

However, it is possible for an attacker to use DNS hi- 
jacking for this purpose. The attacker hijacks the DNS en- 
try for NotBlocked.com to point to an attacker-controlled 
host. The client’s path to this host passes through Telex, 
and the attacker simply forwards traffic from this host to 
NotBlocked.com. Thus, the attacker is able to observe the 
ciphertext traffic on both sides of the Telex station, and 
therefore able to determine when it modifies the traffic. 

Should censors actually implement this attack, we can 
modify Telex stations in the following way to help detect 
DNS hijacking until DNSSEC is widely adopted. When 
it observes a tagged connection to a particular server IP, 
the station performs a DNS lookup based on the common 
name observed in the X.509 certificate. This DNS lookup 
returns a list of IP addresses. If the server IP for the 
tagged connection appears in this list, the Telex station 
will respond to the client and proxy the connection. Oth- 
erwise, the station will not deviate from the TLS protocol, 
as it is possible that the censor is hijacking DNS. This 
may lead to false negatives, as DNS is not globally con- 
sistent for many sites, but as long as the censor has not 
compromised the DNS chain that the station uses, there 
will be no false positives. For popular sites, we could also 
add a whitelisted cache of IP addresses. 

Since the censor controls part of the network between 
the client and the Telex station, it could also try to redirect 
the connection by other means, such as transparently prox- 
ying the connection to a censor-controlled host. In these 
cases, the destination IP address observed by Telex will 
be different from the one specified by the client. Thus, 
the context strings constructed by the client and Telex 
will differ, and Telex will not recognize the connection 
as tagged. This attack offers the adversary an expensive 
denial of service attack, but it does not allow the attacker 
to detect attempted use of Telex. 


Denial of service A censor may attempt to deny service 
from Telex in two ways. First, it may attempt to exhaust 
Telex’s bandwidth to proxy to Blocked.com. Second, it 
may attempt to exhaust a Telex station’s tag detection 
capabilities by creating a large amount of ClientHello 
messages for the station to check. Both methods are overt 
attacks that may cause unwanted political backlash on the 
censor or even provoke an international incident. To com- 
bat the first attack, we can implement a client puzzle [20], 
where Telex issues a computationally intensive puzzle 
the client must solve before we allow proxy service. The 
client puzzle should be outsourced [32] to avoid addi- 
tional latency that might distinguish Telex handshakes 
from normal TLS handshakes. To combat the second 
attack, we can implement our tag checking in hardware 
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to increase throughput if necessary. 


7 Implementation 


To demonstrate the feasibility of Telex, we implemented 
a proof-of-concept client and station. While we believe 
these prototypes are useful models for research and exper- 
imentation, we emphasize that they may not provide the 
performance or security of a more polished production 
implementation, and should be used accordingly. 


7.1 Client 


Our prototype client program, which we refer to as 
telex_client, is designed to allow any program that 
uses TCP sockets to connect to the Telex service without 
modification. It is written in approximately 1200 lines of 
C (including 500 lines of shared TLS utility code) and 
uses libevent to manage multiple connections. The user 
initializes telex_client by specifying a local port and 
a remote TLS server that is not blocked by the censor (e.g. 
NotBlocked.com). Once telex_client launches, it be- 
gins by listening on the specified local TCP socket. Each 
time a program connects to this socket, telex_client 
initiates a TLS connection to the unblocked server spec- 
ified previously. Following the Telex-TLS handshake 
protocol (see Section 5.2), telex_client inserts a tag, 
generated using the scheme described in Section 4, into 
the ClientHello nonce. We modified OpenSSL to accept 
supplied values for the nonce as well as the client’s Diffie- 
Hellman exponent. We supply this 1024-bit value as the 
output of a secure pseudorandom generator with input 
ks associated with the previously generated tag. These 
changes required us to modify fewer than 20 lines of code 
in OpenSSL 1.0.0. 


7.2 Station 


Our prototype Telex station uses a modular design to pro- 
vide a basis for scaling the system to high-speed links and 
to ensure reliability. In particular, it fails safely: simple 
failures of the components will not impact non-Telex TLS 
traffic. The implementation is divided into three compo- 
nents, which are responsible for diversion, recognition, 
and proxying of network flows. 


Diversion The first component consists of a router at 
the ISP hosting the Telex station. It is configured to allow 
the Telex station to passively monitor TLS packets (e.g., 
TCP port 443) via a tap interface. Normally, the router 
will also forward the packets towards their destination, 
but the recognition and relay components can selectively 
command it to not forward traffic for particular flows. 
This allows the other components to selectively manipu- 
late packets and then reinject them into the network. In 
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our implementation, the router is a Linux system that uses 
the iptables and ipset [19] utilities for flow blocking. 


Recognition During the TLS handshake, the Telex 
station recognizes tagged connections by inspecting the 
ClientHello nonces. In our implementation, the recog- 
nition subsystem reconstructs the TCP connection using 
the Bro Network Intrusion Detection System [23]. Bro 
reconstructs the application-layer stream and provides 
an event-based framework for processing packets. We 
used the Bro scripting language for packet processing 
(approximately 300 lines), and we added new Bro built-in 
functions using C++ (approximately 450 lines). 

When the Bro script recognizes a TLS ClientHello 
message, it checks the client nonce to see whether it is 
tagged. (The tag checking logic is a C implementation 
of the algorithm described in Section 4.) If the nonce 
is tagged, we extract the shared secret associated with 
the tag and create an entry for the connection in a table 
indexed by flow. All future event handlers test whether 
the flow triggering the event is contained in this table, and 
do nothing if it is not. 

The Bro script then instructs the diversion component 
(via a persistent TCP connection) to block the associated 
flow. As this does not affect the tap, our script still re- 
ceives the associated packets, and the script is responsible 
for actively forwarding them until the TLS Finished mes- 
sages are observed. This allows the Bro script to inspect 
each packet before forwarding it, while ensuring that any 
delays in processing will not cause a packet that should 
be blocked to make it through the router (e.g., a TLS Ap- 
plicationData packet from NotBlocked.com to the client). 
To derive the TLS shared secret from the key exchange, 
our Bro script also stores the necessary parameters from 
the TLS ServerKeyExchange message in the connection 
table. 

Once it observes the server’s TLS Finished handshake 
message, our Bro script stops forwarding packets between 
the client and the server (thus atomically severing traf- 
fic flow between them) and sends the connection state, 
which includes the TCP-level state (sequence number, 
TCP options, windows, etc.), the key exchange parame- 
ters, and the shared secret k,;, to the proxy service compo- 
nent. Our proof-of-concept implementation handles only 
the TCP timestamp, selective acknowledgements (SACK), 
and window scaling options, but other options could be 
handled similarly. Likewise, we currently only support 
TLS’s Diffie-Hellman key exchange, but RSA and other 
key exchange methods could also be supported. 


Proxy service The proxy service component plays the 
role of the TLS server and connects the client to blocked 
websites. Our implementation consists of a user space 
process called telex_relay and an associated kernel 
module, which are responsible for decapsulating TLS 
connection data and passing it to a local Squid proxy [25]. 
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The telex_relay process is responsible for relaying 
data from the client to the Squid proxy, in effect spoofing 
the server side of the connection. We defer forwarding 
of the last TLS Finished message until telex_relay has 
initialized its connection state in order to ensure that all 
application data is observed. We implement this delay by 
including the packet containing TLS Finished message in 
the state sent from our Bro script and leaving the task of 
forwarding the packet to its destination to telex_relay, 
thus avoiding further synchronization between the com- 
ponents. 

Similarly to telex_client, telex_relay is written 
in about 1250 lines of C (again including shared TLS 
utility code) and uses libevent to manage multiple connec- 
tions. It reuses our modifications to OpenSSL in order to 
substitute our shared secret for OpenSSL’s shared secret. 
We implement relaying of packets between the client and 
the Telex service straightforwardly, by registering event 
handlers to read from one party and write to the other 
using the usual send and recv system calls on the one 
hand and SSL_read and SSL_write on the other. 

To avoid easy detection, the relay’s TCP implementa- 
tion must appear similar to that of the original TLS server. 
Ideally, telex_relay would simply bind (2) to the ad- 
dress of the original server and set the IP_TRANSPARENT 
socket option, which, in conjunction with appropriate 
firewall and routing rules for transparent proxying [29], 
would cause its socket to function normally despite be- 
ing bound to a non-local address. This would cause the 
relay’s TCP implementation to be identical to that of the 
operating system that hosts it. However, the TCP hand- 
shake has already happened by the time our Bro script 
redirects the connection to telex_relay, so we need a 
method of communicating the state negotiated during the 
handshake to the TCP implementation. Accordingly, we 
modified the Linux 2.6.37 kernel to add a fake_accept 
ioctl that allows a userspace application to create a seem- 
ingly connected socket with arbitrary TCP state, including 
endpoint addresses, ports, sequence numbers, timestamps, 
and windows. 


8 Evaluation 


In this section, we evaluate the feasibility of our Telex 
proxy prototype based on measurements of its perfor- 
mance. 


8.1 Model deployment 


We used a small model deployment consisting of three 
machines connected in a hub-and-spoke topology. Our 
simulated router is the hub of our deployment, and the 
two machines connected are the Telex station, and a web 
server serving pages over HTTPS and HTTP. The Telex 
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Figure 4: Client Request Throughput — We measured 
the rate at which two client machines could complete 
HTTP requests for a 1 kB page over a laboratory network, 
using either TLS or our Telex prototype. The prototype’s 
performance was competitive with that of unmodified 
TLS. 


station has a 2.93 GHz Intel Core 2 Duo E7500 processor 
and 2 GB of RAM. The server has a 4-core, 2.26 GHz 
Intel Xeon E55200 processor and 11 GB of RAM. The 
router has a 3.40 GHz Intel Pentium D processor and 1 GB 
of RAM. All of the machines in our deployment and tests 
are running Ubuntu Server 10.10 and are interconnected 
using Gigabit Ethernet. 


8.2 Tagging performance 


We evaluated our tagging implementation by generating 
and verifying tags in bulk using a single CPU core on 
the Telex station. We performed ten trials, each of which 
processed a batch of 100,000 tags. The mean time to gen- 
erate a batch was 18.24 seconds with a standard deviation 
of 0.016 seconds, and the mean time to verify a batch was 
9.03 seconds with a standard deviation of 0.0083 seconds. 
This corresponds to a throughput of approximately 5482 
tags generated per second and 11074 tags verified per 
second. As our TLS throughput experiments show, tag 
verification appears very unlikely to be a bottleneck in 
our system. 


8.3. Telex-TLS performance 


To compare the overhead of Telex, we used our model 
deployment with two additional clients connected to the 
router. Our primary client machine (client A) has a 2.93 
GHz Intel Core 2 Duo E7500 processor and 2 GB of 
RAM. The secondary client machine (client B) has a 3.40 
GHz Intel Pentium D processor and 2 GB of RAM. For 
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our control, we used the Apache benchmark ab [1] to 
have each of the clients simultaneously download a static 
1-kilobyte page over HTTPS. To compare to Telex, we 
then configured ab to download the same page through 
the telex_client. Because the Telex tunnel itself is 
encrypted with TLS, we configured ab to use HTTP, 
not HTTPS, in this latter case. For the NotBlocked.com 
used by telex_client, we used our server on port 443 
(HTTPS) and for Blocked.com, we used our same server 
on port 80 (HTTP). 

We modified ab to ensure that only successful connec- 
tions were counted in throughput numbers and to override 
its use of OpenSSL’s SSL_OP_ALL option. This option 
originally caused ab to send fewer packets than a default 
configuration of OpenSSL, allowing the TLS control to 
perform artificially better at the cost of decreased security. 

We used ab to perform batches of 1000 connections 
(ab -n 1000); in each batch, we configured it to use a 
variable number of concurrent connections. We repeated 
each trial on our two clients (client A and client B) to get 
a mean connection throughput for each client. 

The results are shown in Figure 4; the performance 
of the Telex tunnel lags behind that of TLS at low con- 
currency, but catches up at higher concurrencies. The 
observered performance is consistent with Telex introduc- 
ing higher latency but similar throughput, which we posit 
is due to Telex’s additional processing and network delay 
(e.g., execution of the fake_accept ioctl). Both Telex 
and TLS exhibit diminishing returns from more than 10 
concurrent requests, and both start to plateau at 30 con- 
current requests. Manual inspection of client machines’ 
CPU utilization confirms that the tests are CPU bound by 
50 concurrent connections. 


8.4 Real-world experience 


To test functionality on a real censor’s network, we ran a 
Telex client on a PlanetLab [24] node located in Beijing 
and attempted connections to each of the Alexa top 100 
websites [2] using our model Telex station located at the 
University of Michigan. As a control, we first loaded these 
sites without using Telex and noted apparent censorship 
behavior for 17 of them, including 4 from the top 10: face- 
book.com, youtube.com, blogspot.com and twitter.com. 
The blocking techniques we observed included forged 
RST packets, false DNS results, and destination IP black 
holes, which are consistent with previous findings [15]. 
We successfully loaded all 100 sites using Telex. We also 
compared the time taken to load the 83 unblocked sites 
with and without Telex. While this metric was difficult 
to measure accurately due to varying network conditions, 
we observed a median overhead of approximately 60%. 
To approximate the user experience of a client in China, 
we configured a web browser on a machine in Michigan 
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to proxy its connections over an SSH tunnel to our Telex 
client running in Beijing. Though each request traveled 
from Ann Arbor to China and back before being for- 
warded to its destination website (a detour of at least 
32,000 km), we were able to browse the Internet uncen- 
sored, and even to watch streaming YouTube videos. 

Anecdotally, three of the authors have used Telex for 
their daily Web browsing for about two months, from 
various locations in the United States, with acceptable 
stability and little noticeable performance degradation. 
The system received additional stress testing because an 
early version of the Telex client did not restrict incom- 
ing connections to the local host, and, as a result, one 
of the authors’ computers was enlisted by others as an 
open proxy. Given the amount of malicious activity we 
observed before the issue was corrected, our prototype 
deployment appears to be robust enough to handle small- 
scale everyday use. 


9 Future Work 


Maturing Telex from our current proof-of-concept to a 
large-scale production deployment will require substantial 
work. In this section, we identify four areas for future 
improvement. 


Traffic shaping An advanced censor may be able to 
distinguish Telex activity from normal TLS connections 
by analyzing traffic characteristics such as the packet and 
document sizes and packet timing. We conjecture that this 
would be difficult to do on a large scale due to the large 
variety of sites that can serve as NotBlocked and the dis- 
ruptive impact of false positives. Nevertheless, in future 
work we plan to adapt techniques from prior work [18] 
to defend Telex against such analysis. In particular, we 
anticipate using a dynamic padding scheme to mimic the 
traffic characteristics of NotBlocked.com. Briefly, for 
every client request meant for Blocked.com, the Telex 
station would generate a real request to NotBlocked.com 
and use the reply from NotBlocked.com to restrict the 
timing and length of the reply from Blocked.com (as- 
suming the Blocked.com reply arrived earlier). If the 
NotBlocked.com data arrived first, the station would send 
padding as a reply to the client, including a command to 
send a second “request” if necessary to ensure that the 
apparent document length, packet size, and round trip 
time remained consistent with that of NotBlocked.com. 


Server mimicry Different service implementations 
and TCP stacks are easily distinguished by their observ- 
able behavior [21, Chapter 8]. This presents a substantial 
challenge for Telex: to avoid detection when the Not- 
Blocked.com server and the Telex station run different 
software, a production implementation of Telex would 
need to accurately mimic the characteristics of many com- 
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mon server configurations. Our prototype implementation 
does not attempt this, and we have noted a variety of ways 
that it deviates from TLS servers we have tested. These 
deviations include properties at the IP layer (e.g. stale IP 
ID fields), the TCP layer (e.g. incorrect congestion win- 
dows, which is detectable by early acknowledgements), 
and the TLS layer (e.g. different compression methods 
and extensions provided by our more recent OpenSSL 
version). While these specific examples may themselves 
be trivial to fix, convincingly mimicking a diverse popu- 
lation of sites will likely require substantial engineering 
effort. One approach would be for the Telex station to 
maintain a set of userspace implementations of popular 
TCP stacks and use the appropriate one to masquerade as 
NotBlocked.com. 


Station scalability | Widescale Telex deployment will 
likely require Telex stations to scale to thousands of con- 
current connections, which is beyond the capacity of our 
prototype. We plan to investigate techniques for adapt- 
ing station components to run on multiple distributed 
machines. Clustering techniques [31] developed for in- 
creasing the scalability of the Bro IDS may be applicable. 


Station placement Telex raises a number of questions 
related to Internet topography. How many ISPs would 
need to participate to provide global coverage? Short of 
this, where should stations be placed to optimally cover a 
particular censor’s network? We leave accurate deploy- 
ment modelling for future work. 

Furthermore, we currently make the optimistic assump- 
tion that all packets for the client’s connection to Not- 
Blocked.com pass through some particular Telex station, 
but this might not be the case if there are asymmetric 
routes or other complications. Does this assumption hold 
widely enough for Telex to be practically deployed? If 
not, the system could be enhanced in future work to sup- 
port cooperation among Telex stations on different paths, 
or to support multi-headed stations consisting of several 
routers in different locations diverting traffic to common 
recognition and relay components. 


10 Conclusion 


In this paper, we introduced Telex, a new concept in 
censorship resistance. By moving anticensorship service 
from the edge of the network into the core network infras- 
tructure, Telex has the potential to provide both greater 
resistance to blocking and higher performance than ex- 
isting approaches. We proposed a protocol for stegano- 
graphically implementing Telex on top of TLS, and we 
supported its feasibility with a proof-of-concept imple- 
mentation. Scaling up to a production implementation 
will require substantial engineering effort and close part- 
nerships with ISPs, and we acknowledge that worldwide 
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deployment seems unlikely without government partici- 
pation. However, Internet access increasingly promises to 
empower citizens of repressive governments like never be- 
fore, and we expect censorship-resistant communication 
to play a growing part in foreign policy. 
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A Tagging Details 


Our system uses an elliptic curve E defined over a field of 
prime order p. We choose p to be 3 mod 4, so that —1 will 
be a quadratic nonresidue mod p. (z is a quadratic residue 
mod p if there exists an integer y such that y? = z mod p. 
Otherwise, z is a quadratic nonresidue mod p. Half of 
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the non-zero elements mod p are quadratic residues, and 
half are nonresidues.) Let £, be the bit length of p, and 
ensure that 2° — p < /p. The curve EF is defined by the 
equation y* = x? —3x+b mod p for a particular value of 
b. 

For some values of x € F,, z= x? —3x+b will be a 


quadratic residue mod p; for those values, y = ge will 
be a square root of z and (x,y) will be on the elliptic curve 
E. 

The other values of x will never occur as the x- 
coordinate of a point on the elliptic curve E’; however, for 


those values of x, —z will be a quadratic residue, y = or 
will be a square root of —z, and (x,y) will be a point on 
the “twist” curve E’ defined by —y? = x° —3x+b. We 
choose a value of b such that both E and E’ have prime 
order over F,. It is a fact about elliptic curves that the 
orders o and o’ of E and E’ will satisfy o = p+1—t and 
o' = p+1++t, for some |t| < 2,/p. 

Define a function @ : {0,1} x {0,1} — {0,1}, 
such that $(r,x) is the point multiplication on the ellip- 
tic curve (E or E’) which contains a point X with x- 
coordinate x. To compute $(r,x), consider r and x as 
integers expressed as little-endian strings. x will be the x- 
coordinate of a point X = (x,y) on one of the curves. On 
that curve, compute R = r-X, and output the x-coordinate 
of R, expressed as a little-endian string. If R is the point 
at infinity (which happens if and only if r is a multiple of 
the curve order), (r,x) is undefined. We note that this 
is the same function (albeit over different curves) as was 
used by Bernstein in Curve25519 [3]. 

The tagging protocol is as follows: 





Setup Telex selects arbitrary generators of E and E’ 
and publishes their x-coordinates as little-endian strings 
go and g;. Since E and E’ have prime order, any non- 
identity element is a generator of those groups. Telex 
selects a random private key r € {0,1}/”, and publishes 
O = O(r,g0) and a = ¢(r,g1). If either of those val- 
ues is undefined because r is a multiple of either group 
order (this happens with probability less than 2?~"”), a 
different value for r can be selected. Telex also pub- 
lishes hash functions H, : {0,1}* > {0,1}: and Hp : 
{0,1}* > {0,1}. 

Client tag generation Given a context string x, the 
client selects a random s € {0,1} and a random bit 
b € {0,1}. The client computes B = 6(s,g,) and k = 
(s,Q). (The bit b selects whether the client will be us- 
ing E or E’.) In the extremely unlikely event (probability 
approximately 2!~‘r) that s is a multiple of the group 
order, @(s, O,) will be undefined, and the client can select 
a different s. The client publishes the tag B||H1(k||7) and 
stores the shared secret key H(k||7) for later use. Again 
viewing @ as point multiplication, we can see that the gen- 
eration of the value k is just elliptic curve Diffie-Hellman; 
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we will exploit this fact in the security argument below. 


Telex tag inspection Given a context string 7 and a 
purported (¢, + x, )-bit tag, the Telex station parses the 
tag as B||h where P is @, bits and h is ¢y, bits. It computes 
k' = o(r,B) and h’ = Hj (k'||x). If h =H’, the Telex station 
accepts the tag as valid, and outputs H>(k’||7v) as the 
shared secret key for later use. Otherwise, it rejects the 
tag as invalid. 


A.1 Parameter selection 


In our implementation, we use p = 2'©8 — 28 — 1 (and so 
£p = 168). Using sage version 4.5.2 [26], we searched 
for an appropriate value of b by randomly selecting can- 
didate values of b until the orders of E and E’ both 
turned out to be prime. This search took only a few 
minutes on an 8-core computer, and yielded the value b = 
114301813541519167821 195403070898020343878856329174. The 
curve E has order p+1-—t and the twist E’ has 
order p+1+¢ (both of which are prime) for t = 
—25904187505858679946718103. go is the 168-bit 
little-endian representation of the number 2, and gj is 
likewise of the number 0. The hash functions H; and 
Hp are both based on the SHA256 hash function; we se- 
lect £7, = 56 and €, = 128, and set H, to be the first 
56 bits of the SHA256 output, and H> to be the last 128 
bits of the SHA256 output. The resulting tag length is 
lp +x, = 224 bits, which is the size of the random por- 
tion of a TLS ClientHello message. 

Choosing £, = 168 requires an adversary (under the 
usual security assumptions for elliptic curves) to perform 
284 computations in order to break the tagging scheme by 
recovering the private key from the public key (and thus 
violating the DDH assumption below). While we believe 
this is sufficient, there are a number of methods we can use 
to guard against even more powerful adversaries. The first 
is that the key strength (2'r/2) can be traded off against 
the rate of false positives (2-H ) under the restriction that 
lp +x, = 224. There are also other places [17] one can 
hide random-looking bits in a TLS session, to increase 
from the 224 bits we use to hide our tag. Next, we can 
limit the utility of expending massive effort to recover 
the Telex private key by having multiple keys that may 
correspond to time, source, and/or destination. These 
public keys could be bundled with the Telex client code. 
Depending on the duration each public key is used, time- 
based keys would have to be refetched periodically. As an 
example, a system that switches public keys every hour 
could bundle | million keys, enough to last for over 114 
years, in only 42 MB of space. 


A.2 Security argument 


We must argue that an adversary, given go, g1, Qo, 1, 
and a candidate tag tT, cannot determine whether Tt was 
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an output from the above client tag generation algorithm 
or was just a (¢, + €, )-bit string generated uniformly at 





claim that the distribution of B values is only negligibly 
different from a uniform distribution of £,-bit values, and 
also that, under reasonable cryptographic assumptions, 
given B, an adversary cannot distinguish the correct value 
of h that would appear in a valid tag from a random £,;, -bit 
value. 

To see the former, consider the distribution of possible 
values of B = 6(s,go) as s ranges over {0,1}. Treating 
s as a number, this distribution is only negligibly differ- 
ent from that resulting from the range 1 < s < 0, where 
o is the order of E. The latter is the distribution of x- 
coordinates of a uniformly selected (non-infinity) point of 
E. Let Lo be the set of values x € F, such that a —3xb 
is a quadratic residue. Then every value in Lo appears as 
the x-coordinate of two points of F, except possibly for 
up to 3 points whose y-coordinates are 0, which appear 
only once each. The previous distribution is then only 
negligibly different from the uniform distribution on Lo. 
If L; is the set of values x € F,, such that x3 —3x+b is 
a quadratic nonresidue, then the same argument shows 
that the distribution of possible values of B = @(s,g1) 
is only negligibly different from a uniform distribution 
on L;. The required distribution of f is then negligibly 
different from the result of selecting a uniform element of 
Ly where b is a uniform random bit. Since the sizes of Lo 
and L; are negligibly different, and Lo and Ly are disjoint, 
and the size of Ly UL, is p, which is negligibly different 
from 2"? (as we chose p to be only slightly smaller than a 
power of 2), our result follows. 

To see the latter, we require the Decision (Co-)Diffie- 
Hellman (DDH and DCoDH) assumptions [4, 5]: that 
no adversary, given the points P and rP, can distin- 
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guish the distributions {(Q,rQ)} and {(Q,r’Q)} with 
non-negligible advantage, where P and Q are points on ei- 
ther E or E’ and r and ’ are selected uniformly at random 
from their respective domains (or, as above, from (0, 2°”)). 
If P are Q are on the same curve, this is DDH; if one is 
on E and one on E’, this is DCoDH. We also need an 
assumption on the properties of H;; namely, that for any 
x and any bit b, the distribution {H)(@(s, a) ||x%)} over 
all s is indistinguishable from the uniform distribution on 
€y,-bit strings. (This is of course true if H; is modelled 
as a random oracle, but seems likely to be true for our 
SHA256-based H as well.) 

An adversary that can distinguish 
{(6(5,g6),Fi(0(s,a)l|z))} from {(0(s,85),$} 
(where $ are uniform ¢7,-bit values) can also 
distinguish —_{((s,g5),Hi(0(s,e%)||Z))} from 
{(6(s, gp), (0(s',a5)|IZ))} by our assumption 
on Hy. He can then distinguish {(@(s, gp), @(s, o))} 


from {((s,gp),9(s’,Q»))} by taking hashes, and 
Tenis from Osoe van by taking x- -coordinates, 


where Gy is the elliptic curve point with x-coordinate 
gp and Ay, is the elliptic curve point with x-coordinate 
a. Writing Q = sG, and r’ = s's~!, and noting 
that A, = rGy», this is the same as distinguishing the 
distributions {(Q,rQ)} and {(Q,r’Q)}, given G, and 
Ap =rGp, which is impossible by the DDH assumption. 
Care must also be taken to ensure that the adversary’s 
knowledge of (G,_»,A1_p) does not aid him, but this can 
also be seen to be true by DCoDH. 

In summary, under the DDH and DCoDH assumptions 
on E and E’ and a random-looking-output assumption 
on Hj, an adversary who does not know Telex’s private 
key r cannot distinguish valid tags from uniformly gen- 
erated (¢, + €, )-bit strings with more than a negligible 
advantage. 
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Abstract 


Existing anonymous communication systems like Tor do 
not scale well as they require all users to maintain up-to- 
date information about all available Tor relays in the sys- 
tem. Current proposals for scaling anonymous commu- 
nication advocate a peer-to-peer (P2P) approach. While 
the P2P paradigm scales to millions of nodes, it pro- 
vides new opportunities to compromise anonymity. In 
this paper, we step away from the P2P paradigm and ad- 
vocate a client-server approach to scalable anonymity. 
We propose PIR-Tor, an architecture for the Tor net- 
work in which users obtain information about only a few 
onion routers using private information retrieval tech- 
niques. Obtaining information about only a few onion 
routers is the key to the scalability of our approach, while 
the use of private retrieval information techniques helps 
preserve client anonymity. The security of our architec- 
ture depends on the security of PIR schemes which are 
well understood and relatively easy to analyze, as op- 
posed to peer-to-peer designs that require analyzing ex- 
tremely complex and dynamic systems. In particular, we 
demonstrate that reasonable parameters of our architec- 
ture provide equivalent security to that of the Tor net- 
work. Moreover, our experimental results show that the 
overhead of PIR-Tor is manageable even when the Tor 
network scales by two orders of magnitude. 


1 Introduction 


As more of our daily activities shift online, the issue of 
user privacy comes to the forefront. Anonymous com- 
munication is a privacy enhancing technology that en- 
ables a user to communicate with a recipient without re- 
vealing her identity (IP address) to the recipient or a third 
party (for example, Internet routers). Tor [10] is a de- 
ployed network for anonymous communication, which 


*An extended version of this paper is available [26]. 
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consists of about 2000 relays and currently serves hun- 
dreds of thousands of users a day [45]. Tor is widely used 
by whistleblowers, journalists, businesses, law enforce- 
ment and government organizations, and regular citizens 
concerned about their privacy [46]. 

Tor requires each user to maintain up-to-date infor- 
mation about all available relays in the network (global 
view). As the number of relays and clients increases, 
the cost of maintaining this global view becomes pro- 
hibitively expensive. In fact, McLachlan et al. [22] 
showed that in the near future the Tor network could be 
spending more bandwidth for maintaining a global view 
of the system than for anonymous communication itself. 
Existing approaches to improving Tor’s scalability ad- 
vocate a peer-to-peer approach. While the peer-to-peer 
paradigm scales to millions of relays, it also provides 
new opportunities for attack. The complexity of the de- 
signs makes it difficult for the authors to provide rigorous 
proofs of security. The result is that the security commu- 
nity has been very successful at breaking the state-of-art 
peer-to-peer anonymity designs [4, 6, 7,23, 47, 48]. 

In this paper, we step away from the peer-to-peer 
paradigm and propose PIR-Tor, a scalable client-server 
approach to anonymous communication. The key obser- 
vation motivating our architecture is that clients require 
information about only a few relays (3 in the current 
Tor network) to build a circuit for anonymous commu- 
nication. Currently, clients download the entire database 
of relays to protect their anonymity from compromised 
directory servers. In our proposal, on the other hand, 
clients use private information retrieval (PIR) techniques 
to download information about only a few relays. PIR 
prevents untrusted directory servers from learning any 
information about the clients’ choices of relays, and thus 
mitigates route fingerprinting attacks [6,7]. 

We consider two architectures for PIR-Tor, based on 
the use of computational PIR and information-theoretic 
PIR, and evaluate their performance and security. We 
find that for the creation of a single circuit, the archi- 
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tecture based on computational PIR provides an order 
of magnitude improvement over a full download of all 
descriptors, while the information-theoretic architecture 
provides two orders of magnitude improvement over a 
full download. However, in the scenario where clients 
wish to build multiple circuits, several PIR queries must 
be performed and the communication overhead of the 
computational PIR architecture quickly approaches that 
of a full download. In this case, we propose to perform 
only a few PIR queries and reuse their results for cre- 
ating multiple circuits, and discuss the security implica- 
tions of the same. On the other hand, for the information- 
theoretic architecture, we find that even with multiple cir- 
cuits, the communication overhead is at least an order of 
magnitude smaller than a full download. It is therefore 
feasible for clients to perform a PIR query for each de- 
sired circuit. In particular, we show that, subject to cer- 
tain constraints, this results in security equivalent to the 
current Tor network. With our improvements, the Tor 
network can easily sustain a 10-fold increase in both re- 
lays and clients. PIR-Tor also enables a scenario where 
all clients convert to middle-only relays, improving the 
security and the performance of the Tor network [9]. 

The remainder of this paper is organized as follows. 
We discuss related work in Section 2. We present a brief 
overview of Tor and private information retrieval in Sec- 
tion 3. In Section 4, we give an overview of our system 
architecture, and present the full protocol in Section 5. 
We discuss the traffic analysis implications of our archi- 
tecture in Section 6. Sections 7 and 8 contain our perfor- 
mance evaluation for the computational and information- 
theoretic PIR proposals respectively. We discuss the 
ramifications of our design in Section 9, and finally con- 
clude in Section 10. 


2 Related Work 


In contrast to our client-server approach, prior work 
mostly advocates a peer-to-peer approach for scalable 
anonymous communication. We can categorize existing 
work on peer-to-peer anonymity into architectures that 
are based on random walks on unstructured or structured 
topologies, and architectures that use a lookup operation 
in a distributed hash table. 

Besides these peer-to-peer approaches Mittal et 
al. [25] briefly considered the idea of using PIR queries 
to scale anonymous communication. However, their de- 
scription was not complete, and their evaluation was very 
preliminary. In this paper, we build upon their work and 
present a complete system architecture based on PIR. 
In contrast to prior work, we also consider the use of 
information-theoretic PIR, and show that it outperforms 
computational PIR based Tor architecture in many scal- 
ing scenarios. We also provide an analysis of the im- 
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plications of clients not having the global system view, 
and show that reasonable parameters of PIR-Tor provide 
equivalent security to Tor. 


2.1 Distributed hash table based architec- 
tures 


Distributed hash tables (DHTs), also known as struc- 
tured peer-to-peer topologies, assign neighbor relation- 
ships using a pseudorandom but deterministic mathemat- 
ical formula based on IP addresses or public keys of 
nodes. 

Salsa [29] is built on top of a DHT, and uses a spe- 
cially designed secure lookup operation to select random 
relays in the network. The secure lookups use redundant 
checks to mitigate attacks that try to bias the result of the 
lookup. However, Mittal and Borisov [23] showed that 
Salsa is vulnerable to information leak attacks: as the at- 
tackers can observe a large fraction of the lookups in the 
system, a node’s selection of relays is no longer anony- 
mous and this observation can be used to compromise 
user anonymity [6,7]. Salsa is also vulnerable to a selec- 
tive denial-of-service attack, where nodes break circuits 
that they cannot compromise [4,47]. 

Panchenko et al. proposed NISAN [35] in which 
information-leak attacks are mitigated by a secure iter- 
ative lookup operation with built-in anonymity. The se- 
cure lookup operation uses redundancy to mitigate active 
attacks, but hides the identity of the lookup destination 
from the intermediate nodes by downloading the entire 
routing table of the intermediate nodes and processing 
the lookup operation locally. However, Wang et al. [48] 
were able to drastically reduce the lookup anonymity by 
taking into account the structure of the topology and the 
deterministic nature of the paths traversed by the lookup 
mechanism. 

Torsk, introduced by McLachlan et al. [22], uses secret 
buddy nodes to mitigate information leak attacks. Instead 
of performing a lookup operation themselves, nodes can 
instruct their secret buddy nodes to perform the lookup 
on their behalf. Thus, even if the lookup process is not 
anonymous, the adversary will not be able to link the 
node with the lookup destination (since the relationship 
between a node and its buddy is a secret). However, the 
aforementioned work of Wang et al. [48] also showed 
some vulnerabilities in the mechanism for obtaining se- 
cret buddy nodes. 


2.2. Random walk based architectures 


In MorphMix [38] the scalability problem in Tor is al- 
leviated by organizing relays in an unstructured peer-to- 
peer overlay, where each relay has knowledge of only a 
few other relays in the system. For building circuits, an 
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initiator performs a random walk by first selecting a ran- 
dom neighbor and building an onion routing circuit to 
it. The initiator can then query the neighbor for its list 
of neighbors, select a random peer, and then extend the 
onion routing circuit to it. This process can be iterated a 
number of times to build a random walk of any desired 
length. 

MorphMix is vulnerable to a route capture attack, 
where a malicious relay returns a list of only other col- 
luding nodes during a random walk. This attack ensures 
that once the random walk hits a compromised relay, all 
subsequent relays in the random walk are also compro- 
mised. In particular, when the first relay in the random 
walk is compromised, user anonymity is trivially broken. 
While MorphMix proposed a collusion detection mech- 
anism to mitigate the route capture attack, it was later 
shown that the mechanism can be broken by a collud- 
ing set of attackers that models the internal state of each 
relay [44]. 

Shadow Walker [24] also uses a random walk to locate 
relays, but instead of organizing relays into an unstruc- 
tured overlay, it uses a distributed hash table. Neighbor 
relationships in the DHT are deterministic, and can be 
verified by the initiator to mitigate route capture attacks. 
To prevent any information leakage during verification of 
neighbor information, some redundancy is incorporated 
into the topology itself. Recently, Schuchard et al. [39] 
analyzed an attack on ShadowWalker, and also studied a 
fix for the attack. 

We note that all of the peer-to-peer designs provide 
only heuristic security, and the security community has 
been very successful at breaking the state-of-art designs. 
This is partly because of the complexity of the designs, 
which make it difficult for the system designers to rigor- 
ously analyze the security of the system. We also note 
that all secure peer-to-peer systems are built on top of 
assumptions that are difficult to realize in practice. For 
example, security of these designs depends on the frac- 
tion of compromised relays in the system being less than 
20-25%. Modern botnets can comprise of tens to hun- 
dreds of thousands of bots [19], which is likely sufficient 
to overwhelm the security of the system. In PIR-Tor, we 
target a design where it is feasible to rigorously argue 
about the anonymity properties of the design, and where 
the ability to obtain random relays both securely and 
anonymously does not depend on the fraction of com- 
promised relays in the system. 


3 Background 


3.1 Tor 


Tor [10] is a deployed network for low-latency anony- 
mous communication. Tor serves hundreds of thousands 
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of clients, and carries terabytes of traffic per day [45]. 
The network is comprised of approximately 2000 relays 
as of February 2011 [20]. Tor clients first download a 
complete list of relays (called the network consensus) 
from directory servers, and then further download de- 
tailed information about each of the relays (called the 
relay descriptors). The network consensus is signed by 
trusted directory authorities to prevent directory servers 
from manipulating its contents. Clients select three re- 
lays to build circuits for anonymous communication. A 
fresh network consensus must be downloaded at least as 
often as every 3 hours, while fresh relay descriptors are 
downloaded every 18 hours. 

To protect against certain long-term attacks [33] on 
anonymous communication, each client, when it starts 
Tor for the first time, selects a set of three guard re- 
lays from among fast and stable nodes. As long as the 
selected guards remain available, new ones will not be 
chosen. The first relay in any circuit constructed by the 
client will be one of its three guards. Also, clients select 
the final relay from the subset of the Tor relays which 
allow traffic to exit to the Internet, called the exit relays; 
each exit relay has an exit policy, which lists the ports 
to which the relay is willing to forward traffic, and the 
client’s choice of exit relay must of course be compatible 
with its intended use of the circuit. Any relay is eligible 
to be the middle relay of a circuit. Clients can multiplex 
multiple TCP connections (called streams) over a single 
Tor circuit; the lifetime of a circuit is generally 10 min- 
utes. Finally, Tor relays have heterogeneous bandwidths, 
and subject to the above constraints, clients select a Tor 
relay with a probability that is proportional to a relay’s 
bandwidth.! 


3.2 PIR 


Private information retrieval [5] provides a means of re- 
trieving a block of data out of a database of r blocks, 
without the database server learning any information 
about which block was retrieved. A trivial solution to 
the PIR problem — the one used currently by Tor — 
is to transfer the entire database from the server to the 
client, and then retrieve the block of interest from the 
downloaded database. Although the trivial solution of- 
fers perfect privacy protection, the communication over- 
head is impractical for large databases or for a system 
like Tor where minimizing bandwidth usage remains a 
high priority. PIR schemes are therefore designed to pro- 
vide sublinear communication complexity. 

We can classify PIR schemes in terms of their pri- 
vacy guarantees and the number of servers required for 


'Since not all relays are eligible for every position, some additional 


load-balancing logic is used to underweight relays eligible to be guards 
or exits when choosing middle relays. 
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the protection they provide. Information-theoretic PIR 
schemes (ITPIR) are multi-server schemes that guaran- 
tee query privacy irrespective of the computational capa- 
bilities of the servers answering the user’s query. ITPIR 
schemes assume the database servers are not colluding 
to determine the user’s query. Single-server computa- 
tional PIR schemes (CPIR), on the other hand, assume 
a computationally limited database server that is unable 
to break a hard computational problem, such as the dif- 
ficulty of factoring large integers. The noncollusion re- 
quirement is then removed, at some cost to efficiency. 

We choose the single-server lattice-based scheme by 
Aguilar-Melchor et al. [1] as an example of CPIR, and 
the multi-server scheme by Goldberg [12] as an exam- 
ple of ITPIR. The CPIR scheme is the best-performing 
single-server scheme [32], and both are available as 
open-source libraries. 


4 System Overview 


4.1 Design goals 


1. Scalable architecture: We target a design for anony- 
mous communication that is able to scale the number of 
relays and clients in the network. We note that a design 
that is able to accommodate more relays in the network 
not only improves the network performance, but also im- 
proves user anonymity [9]. 


2. Security: Prior work on scalable anonymous com- 
munication only provides heuristic security guarantees, 
and the security community has been very successful at 
breaking the state-of-art designs. We target a design that 
leverages well-understood security mechanisms making 
it relatively easy to analyze the security of the system. 
Secondly, we aim to achieve similar security properties 
as in the existing Tor network. We show that reasonable 
parameters of PIR-Tor are able to provide equivalent se- 
curity to the Tor network. 


3. Efficient circuit creation: Architectures that im- 
pose additional latency during circuit creation may not 
be practical, since the user needs to wait for the circuit 
creation to finish before starting anonymous communi- 
cation. 


4. Minimal changes: We target a design that requires 
minimal changes to the existing Tor architecture. For in- 
stance, transitioning Tor to a peer-to-peer system will re- 
quire a significant engineering effort. Our design lever- 
ages existing implementations and requires changes to 
only the directory functionality and relay selection mech- 
anism in Tor and can be incrementally deployed by both 
clients and relays. 


5. Preserving Tor constraints: The Tor network im- 
poses several constraints on the selection of relays during 
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circuit construction. For example, the first relay must be 
one of the user’s guards, the final relay must allow traffic 
to exit to a user’s desired port, and the relays must be se- 
lected in proportion to their bandwidth for load balancing 
the network. Some prior work like ShadowWalker [24] 
and Salsa [29] did not focus on these issues. 


Limitations: Our architecture achieves its scalabil- 
ity properties by trading off bandwidth for computation; 
thus directory servers will be required to spend additional 
computational resources. In our performance evaluation 
we show that the computational resources required to 
support our architecture are feasible. 


4.2 System architecture 


Our key insight when designing PIR-Tor is that the 
client-server model in Tor can be preserved while si- 
multaneously improving its scalability by having users 
download the descriptors of only a few relays in the 
system, as opposed to downloading the global view. 
However, naively doing so can enable malicious direc- 
tory servers to launch fingerprinting attacks against the 
users, thereby compromising anonymity. We propose 
that users leverage private information retrieval proto- 
cols to download the identities of a few relays, thereby 
protecting their privacy against compromised directory 
servers. Note that a client does not need to use a PIR 
protocol to select its guard relays; a full download of the 
network consensus and relay descriptors suffices, since 
guard relay selection is a one-time operation that does 
not affect the scalability of the protocol. 

Recall that private information retrieval has two fla- 
vors: computational PIR and information-theoretic PIR. 
While both CPIR and ITPIR can be used by clients, the 
underlying techniques have different threat models, re- 
sulting in slightly different architectures, as depicted in 
Figure 1. 


Computational PIR at directory servers: Computa- 
tional PIR can guarantee user privacy even when there 
is a single untrusted database. In this scenario, we pro- 
pose that as in the current Tor architecture, any relay can 
act as a directory server. The directory servers maintain 
a global view of the system, and act as a PIR database. 
Clients can then use a CPIR protocol to query the direc- 
tory servers and obtain the identities of random relays in 
the system. 


Information-theoretic PIR at directory authorities 
(rejected): Information-theoretic PIR can guarantee 
user privacy only when a threshold number of databases 
do not collude. Since directory servers in the current 
Tor network are untrusted, they cannot be used as PIR 
databases. However, Tor has eight directory authorities 
sign the global system view (the network consensus). 
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(b) ITPIR-based architecture 


Figure 1: System Architecture: For the CPIR architecture, an arbitrary set of relays are selected as the directory 
servers (PIR servers) that maintain a current copy of the PIR database, while for the ITPIR architecture, guard relays 
are the directory servers. Directory servers download the PIR database from trusted directory authorities. To perform 
a PIR query, clients first obtain meta-information about the PIR database from the directory servers, and then use the 
meta-information to select the index of the PIR block to query, taking into consideration the bandwidths and the exit 
policies of the relays. Relay information from the results of the PIR queries can be used to build circuits for anonymous 
communication. Note that in the ITPIR architecture, clients use PIR to query for only the exit relays. 


Since Tor already trusts that the majority of directory 
authorities are honest, one potential solution could have 
been to use the directory authorities as PIR databases. 
However, we reject this approach since the directory au- 
thorities would become performance bottlenecks in the 
system, in addition to targets for DDoS attacks. 


Information-theoretic PIR at guard relays: Instead, 
we note that Tor already places significant trust in guard 
nodes. If all of a client’s guard relays are compromised, 
then they can perform end-to-end timing analysis [2] in 
conjunction with selective denial of service attacks [4] to 
break user anonymity in the current Tor network. Thus 
we consider using a client’s three guard nodes as the 
servers for ITPIR. Unless all three guard nodes are com- 
promised they cannot learn the identities of the relays 
downloaded by the clients. Even if all three guard relays 
are compromised, they cannot actively manipulate ele- 
ments in the PIR database since they are signed by the 
directory authorities; they can only learn which exit re- 
lay descriptors were downloaded by the clients. (In Tor, 
guards always know the identities of the middle nodes in 
circuits through them.) If the exit relay in a circuit is hon- 
est, then guard relays cannot break user anonymity. On 
the other hand, if the exit relay used is malicious, then 
user anonymity is broken [6], but in this scenario, the ad- 
versary could have performed end-to-end timing analysis 
anyway [2] (in the current Tor network). 
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5 PIR-Tor Protocol Details 


5.1 Database organization and formatting 


We first note that Tor relays are selected based on some 
constraints. For instance, the first relay must be an en- 
try guard, and the last relay must be an exit relay. We 
propose to organize the list of relays into three separate 
databases, corresponding to guard nodes, middle nodes 
and exit nodes. Note that some relays function as entry 
guards as well as exit relays — such relays are duplicated 
in both the guard database and the exit database. 

In addition to the last relay being an exit, its exit pol- 
icy must satisfy the client application requirements. In 
a February 2011 snapshot of the current Tor network, 
there were 471 standard exits (default exit policy) and 
482 non-standard exits sharing 221 policies. Had the 
number of non-standard exits been small, then clients in 
PIR-Tor could download all the relay descriptors for the 
non-standard exits, and use PIR to select descriptors for 
the standard exits. However, this is not the case. Instead, 
we propose that nodes in the exit database be grouped 
by their exit policies. Furthermore, in order to keep the 
number of groups manageable, we propose that there be 
a small set of standard exit policies that exit relays can 
choose from. Our architecture can accommodate a small 
set of relays with non-standard exit policies, and these 
outliers can be downloaded in their entirety as above. 

Tor relays have heterogeneous bandwidth capabilities, 
and relays with higher capacities are selected with a 
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higher probability in order to load balance the network. 
Bandwidth-weighted selection is straightforward given a 
global view of the network. We now outline two strate- 
gies to enable clients to perform weighted relay selection 
without this global view. The first strategy implements 
the Snader-Borisov [41,42] criterion for relay selection, 
where only the relative rank of the relays in terms of their 
bandwidths is used for relay selection.” The second strat- 
egy is more similar to the current Tor algorithm, where 
the entire bandwidth distribution of relays is taken into 
consideration for relay selection. In both scenarios, we 
first sort relays in each of the databases in order of band- 
width. Clients can use the Snader-Borisov mechanism 
by choosing the relay index to query with probability that 
depends on the index value. For example, if the relays are 
sorted in descending order of bandwidth, then clients can 
select relays having a smaller index with higher probabil- 
ity. To implement an algorithm similar to the current Tor 
network, we propose that clients download a bandwidth 
distribution synopsis from the directory servers, and use 
it to make the relay selection. Finally, we note that the 
exit database is treated as a special case since relays are 
first grouped based on their exit policies, and within each 
group, relays are further sorted by bandwidth. This en- 
ables a client to select an exit relay whose exit policy 
satisfies its application requirements in a load-balanced 
manner. 

The PIR protocols we consider are block-based: the 
database is composed of a number of equal-sized blocks. 
The block size must be large enough to hold at least a sin- 
gle relay descriptor, but may hold more. We must also 
ensure that relay descriptors do not cross block bound- 
aries by padding the database. To guard against active 
attacks by directory servers, each block is signed by the 
directory authorities; the data signed also includes the 
block number (index), the consensus timestamp and a 
database identifier. To minimize overhead, we use the 
threshold BLS signature scheme [3] since signatures in 
that scheme are single group elements (22 bytes, for ex- 
ample, for 80-bit security), regardless of the number of 
directory authorities issuing signatures. 


5.2 PIR Protocols and database locations 
5.2.1 Computational PIR 


Computational PIR protocols can guarantee privacy of 
user queries even with a single untrusted relay acting as 
a PIR database. Thus, we can designate an arbitrary set 
of relays in the network as directory servers, and only 


The use of the Snader-Borisov criterion may have an impact on 
the performance of the Tor network. Murdoch and Watson’s queueing 
model [28] suggests that it will cause greater congestion at Tor relays, 
whereas Snader and Borisov’s flow-level simulations [42] predict sim- 
ilar or even improved network utilization. 
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the directory servers need to maintain a global view of 
all the relays, i.e., a current copy of the network con- 
sensus formatted as above. Then, instead of download- 
ing the entire consensus document from the directory 
server, clients connecting to these directory servers use 
a computational PIR protocol to retrieve a block of their 
choice, without revealing any information about which 
block, to the directory server. While our architecture is 
compatible with all existing CPIR protocols, we use the 
lattice-based scheme proposed by Agular-Melchor and 
Gaborit [1] since it is the computationally fastest scheme 
available. Note that the lattice-based CPIR protocol is a 
single-server protocol, and does not require any interac- 
tion with other directory servers. 


5.2.2 Information-Theoretic PIR 


Information-theoretic PIR protocols guarantee privacy of 
user queries only if a threshold number of PIR databases 
do not collude. As stated above, we use a client’s three 
guard relays as ITPIR directory servers. The parameters 
of the protocol are set such that the guard relays do not 
learn any information about the client’s block unless all 
three of them collude. 


5.3. Client query protocol and meta- 
information exchange 


To query for a middle and exit relay, a client connects 
to one of its directory (PIR) servers, which responds 
back with the meta-information about each of the PIR 
databases, such as the number of blocks in the database, 
the block size, the distribution of exit policies, and a 
bandwidth distribution synopsis. Note that the meta- 
information is also timestamped and signed by the di- 
rectory authorities. Based on this information, clients 
can construct a PIR query to select Tor relays while sat- 
isfying the constraints of the user. Clients can perform 
load balancing based on the Snader-Borisov mechanism 
by selecting an index to query with a probability that de- 
pends on the index value. For greater flexibility, clients 
can perform load balancing in a manner similar to the 
current Tor architecture by using the bandwidth distribu- 
tion synopsis to select an index to query. The PIR queries 
are performed by the clients well in advance of construct- 
ing the circuit, so as not to impose extra latency during 
circuit construction. Note that clients may not be able to 
predict the exit policies required by circuits in advance. 
To bypass this constraint, recall that the relays in the exit 
database are grouped based on a small set of standard 
exit policies, and clients can perform a few PIR queries 
to obtain exit relays that satisfy all standard exit poli- 
cies. Finally, clients can periodically download the relay 
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descriptors of the small set of exit relays that have non- 
standard exit policies (every 3 hours). 

Next, we propose an optimization that clients can per- 
form while using guard relays as directory servers in the 
case of information-theoretic PIR. We note that during 
circuit creation, a guard relay learns the identity of the 
middle relay. Thus the clients could simply skip the 
PIR for the middle database, and directly query a single 
guard relay for a particular block. Note that all blocks are 
signed by the directory authorities, and any active attacks 
by the guard relay will be detected by the client. Also 
note that the fetched descriptors should only be used in 
conjunction with the guard relay from which they were 
obtained; otherwise, even a single compromised guard 
would be able to perform fingerprinting attacks [6]. 


5.4 Circuit Construction 


The circuit creation mechanism remains the same as in 
the current Tor network. In the current Tor network, 
clients construct a new circuit every 10 minutes. As we 
show in Section 8, in the ITPIR scenario, the cost of all 
Tor clients performing one PIR query (since the middle 
relay is fetched without using PIR) every 10 minutes is 
manageable. In the CPIR setting, the communication 
overhead of all Tor clients performing two PIR queries 
in a 10-minute interval is rather high, and we propose to 
perform fewer PIR queries, and reuse descriptors in sub- 
sequent time intervals. We discuss this further in Sec- 
tion 7. 


6 Traffic Analysis Resistance of PIR-Tor 


In this section we evaluate the resistance to traffic anal- 
ysis of PIR-Tor. We consider an adversary that can ob- 
serve some fraction of the network and has the ability to 
generate, modify, delete, or delay traffic. She can com- 
promise a fraction of the relays, or introduce relays of 
her own. Further, we consider that the adversary can ob- 
serve clients’ requests to the PIR-Tor directory servers, 
and knows that in these requests the client only learns 
about a fraction of the relays in the network. 

As pointed out in the past [6,7], clients’ partial knowl- 
edge of the relays belonging to the anonymity network 
enables route fingerprinting attacks. In these papers it is 
assumed that relay discovery is a non-anonymous pro- 
cess. Hence, an adversary observing the discovery pro- 
cess can build a mapping between users and the relays 
they know. If clients learn unique (disjoint) sets of relays, 
their paths can be “fingerprinted”, and the client’s iden- 
tity can be trivially recovered from this mapping. This 
problem does not exist in the current Tor, where query- 
ing the directories provides clients with a global view of 
the network. 


USENIX Association 


In PIR-Tor the threat model slightly differs from the 
one in [6,7]. Directory queries continue being identifi- 
able, but PIR prevents the adversary from learning which 
exact relays were retrieved from the database, avoiding 
the creation of a mapping describing users’ knowledge. 
Therefore, when route fingerprinting is performed the at- 
tack does not result in a direct loss of anonymity. Even 
if the choice of relays appearing in the fingerprint were 
unique, the adversary does not have a way to link this 
fingerprint to a specific client. In fact, the only way for 
the attacker to link the client with the destination of her 
traffic is to control the first and last relays in the path 
and perform a traffic confirmation attack [37], which in 
our system will happen with probability c?, where c is 
the fraction of compromised bandwidth in the network 
— the same probability as in the current Tor network. 

Although route fingerprinting does not result in a di- 
rect loss of anonymity in PIR-Tor, the information leaked 
could be used by the adversary to relate connections from 
the same user and construct behavioral profiles. In turn, 
these profiles can lead to the re-identification of users 
directly [16] or by combining them with publicly avail- 
able databases [14, 30,43]. We note that the linkabil- 
ity of circuits is not a problem unique to PIR-Tor, and 
that features other than partitioning the network (e.g., 
cookies [36], session timing [17], or frequently accessed 
hosts [17]) can be used in the current Tor network to pro- 
file users. 


6.1 Impact of fingerprinting on PIR-Tor 


Before diving into the analysis we note that the number 
of relays (or descriptors) in each PIR block is irrelevant 
for the result. Fingerprinting attacks are based on the 
clients’ knowledge of relays in the network, but in PIR- 
Tor clients retrieve blocks that may contain one or more 
descriptors. Hence, either the client knows about all the 
descriptors in a block or she does not know any of them. 
Thus, from the point of view of the adversary all relays in 
a block are equivalent, regardless of how many descrip- 
tors are in this block; only the number of blocks matters 
when computing the probabilities we use in our analysis. 

We consider an adversary that controls the receiver of 
the communication, and thus can observe the exit relay 
chosen by the client. Additionally, she may also control 
the exit relay hence also learning the middle relay in the 
client’s circuit. 


6.1.1 One PIR request per circuit construction 


If the computation and communication cost for clients 
and directory servers in dealing with PIR queries is small 
(as when ITPIR is used), clients could request new de- 
scriptors for each circuit construction. Regardless of 
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the selection algorithm used, due to the PIR properties, 
the adversary cannot distinguish which block is retrieved 
from the database with each query and hence she gains 
no information as to which relays are known to the client. 
In this setting the adversary must assume that all relays 
are known to the client, and PIR-Tor fingerprinting resis- 
tance is equivalent to that of the current Tor network. 

Nevertheless, when CPIR is used we must expect lim- 
itations both in bandwidth and computation capabilities. 
Therefore, each time the client obtains a set of descrip- 
tors with a CPIR query, these descriptors may have to be 
reused across multiple circuit rebuilds. In the next sec- 
tion we evaluate the impact of this reuse on the privacy 
protection offered by PIR-Tor. 


6.1.2 Reusing descriptors for circuit construction 


In our analysis we assume that the attacker observes 
the exit relay (respectively exit and middle relays) of a 
client’s circuit. As we have already discussed, this does 
not directly leak information about the client’s identity 
and anonymity is preserved. However, the adversary can 
still profile clients based on their network knowledge, 
eventually leading to de-anonymization [14, 16, 30, 43]. 

The adversary can construct a behavioral profile with 
all connections she observes coming from exit relays (re- 
spectively exit and middle relays) that belong to the same 
PIR block. If the selection algorithm is such that many 
clients have knowledge of a block (recall that all relays 
in the block are equivalent for the attacker) the profile 
recovered by the adversary is an aggregate profile of all 
these users, jeopardizing the de-anonymization of indi- 
vidual clients. On the other hand, if the choice of relays 
is unique to each client the profile recovered by the ad- 
versary accurately reflects the behavior of an individual 
user and the danger of de-anonymization grows. There- 
fore, it is desirable that clients share choices such that 
the adversary can only obtain aggregated profiles that 
reduce her precision when re-identifying clients. Other 
ways than relay selection for the attacker to link and/or 
discriminate clients’ connections [17,36] are left out of 
the scope of our analysis. 

In this section, we evaluate the protection against pro- 
filing provided by PIR-Tor when descriptors have to be 
reused across circuit constructions. We aim to answer 
the question “how precisely can the adversary assign an 
observed connection (exit relay, or exit and middle) to 
a unique client?”. We use as a metric the fraction a of 
clients that could be initiators of a connection (i.e., the 
expected fraction of clients that have knowledge of the 
PIR-Tor block containing the relay(s) observed by the 
adversary). The larger the fraction of clients that may 
know the observed relay, the better privacy users enjoy 
because the adversary can only construct aggregate pro- 
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files. We note that even if the adversary is actually col- 
lecting information from a single user, she cannot be sure 
that this is the case based on the PIR-Tor relay selection 
algorithm; she must assume that the profile she observes 
may contain sessions from multiple users. We also note 
that, based on the relay selection algorithm, the adver- 
sary cannot link connections from a user routed through 
different exit relays. This is because the PIR properties 
prevent the attacker from learning any relation between 
the descriptors retrieved by a client. Hence, the connec- 
tions of one client routed through exit relays in different 
PIR blocks are unlinkable and the adversary must assign 
them to different profiles (that may or may not contain 
information about other users). 

If the adversary observes connections coming from the 
exit relay e, the fraction of clients a that may know this 
relay are those who retrieved from the database the block 
containing e. In PIR-Tor we assume that clients retrieve 
a set B of b blocks every time they query the directory 
server, hence the fraction of clients that have knowledge 
of the block containing e is: a = (1 — (1 — Prfe])°), 
where Pr{e] is the probability of choosing the block con- 
taining e as one of the b retrieved blocks, and depends 
on the algorithm used for the selection of relays. For 
simplicity in our analysis we assume that there is only a 
single standard exit policy. 

We explained in Section 5 that for load balancing, re- 
lays with higher capacities are selected with a higher 
probability. We described two criteria for selecting re- 
lays: a bandwidth-based criterion (BW), and the Snader- 
Borisov criterion (SB). To evaluate the BW criterion ac- 
cording to a realistic bandwidth distribution we captured 
a snapshot of the Tor consensus directory on 9 February 
2011. This directory includes 649 exit relay descriptors 
after removing the slowest one-eighth of the total relays 
that are not used to relay traffic at all in the current Tor 
network [31]. For the evaluation of SB we computed the 
probability Pr[e] according to the algorithm introduced 


in [41]. Given the function f,(”) = He a value x 
is drawn uniformly at random from [0, 1), and the block 
with index | Notocks X fs(x)| is selected. The inverse of 
the function f,(a) is the function f>1(x) = (logg(1 — 
(1 — 2%) - ))/s. Then, the probability of selecting a 
block containing the relay e in the 7-th position of the list 
is Prle] = fz *(¢/Notocks) — fs *((t — 1)/Notocks). We 
use s = 1, which results in a probability distribution near 
to uniform, and s = 10, which results in a distribution 
very skewed towards the relays offering high bandwidth. 


Figure 2 shows box plots? describing the distribution 


3The line in the middle of the box represents the median of the 
distribution of a. The lower and upper limits of the box correspond, 
respectively, to the first (Q1) and third quartiles (Q3) of the distribution. 
We also show the outliers: relays e which are chosen with values that 
are “far” from the rest of the distribution (a > Q3 + 1.5(Q3 — Q1) 
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Figure 2: BW and SB(s) selection: evolution of a with 
the database size. 


of a for the different selection algorithms. We choose 
two database sizes to show the performance of the al- 
gorithms when the network scales. A small database 
that contains 649 exit relays (as in the current Tor net- 
work) divided in 13 blocks for optimal performance of 
the CPIR algorithm (note that if ITPIR is used there is 
no need to reuse relays across circuit rebuilds).* The sec- 
ond database contains 1M relays, divided in 148 blocks. 
In the BW case, we construct the distribution of band- 
width amongst the relays by concatenating copies of the 
original list downloaded from the Tor network. We refer 
the reader to the extended version of this paper [26] for 
a more detailed analysis of the evolution of a when the 
network scales. 

The median of the BW distribution is ~ = 0.016; that 
is, 1600 clients have knowledge of each relay when the 
network is used by 100000 users.> When the network 
grows, the median of a diminishes to 0.0012. As there 
are more blocks in the database clients have more choice, 
and so they share knowledge of fewer relays. 

We can see that SB(1) offers the best protection (a 
larger fraction of clients know a relay), but as clients’ 
choice of relays, and hence blocks, is nearly uniform it 
does not load balance the network. The means of SB(1) 
and SB(10) are similar; however, SB(10) has a greater 
variance. SB(10) yields medians of a = 0.022 and 
a = 0.0019 when there are 13 and 148 blocks in the 


ora < Q1— 1.5(Q3 — Q1)). 
We have cut the figure’s y axis for better visibility. The figure does 
not show two outliers for the 13-block BW and SB(10) plots that have 
a = 0.38 and a = 0.63, respectively. 

4For a database of r 2100-byte descriptors and recursion parameter 
(see Section 7) R = 2, the optimal number of blocks is approximately 
1.50- @r. 

>The Tor project reports an estimate of Tor users between 100 000 
and 250000 in January 2011 (http://metrics.torproject. 
org/users.html). We take 100000 to represent the worst-case 
scenario for the clients. 
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database, respectively. In the latter case, the adversary 
still captures aggregated profiles of 190 clients for the 
median relay. 

If besides the receiver of the communication the ad- 
versary also controls the exit relay, then she can observe 
the middle and exit relays of the client’s path. Let us call 
the observed exit relay e, and the observed middle relay 
m. The fraction of clients knowing the blocks contain- 
ing these relays is: Pr[e,m € B] = (1 — (1 — Prfe])°) - 
(1—(1—Pr[m])°), where Pr[e] and Pr[m] depend on the 
path selection algorithm. Hence, Pr[e,m € B] is orders 
of magnitude smaller than Pr[e € B] increasing the ac- 
curacy of profiling, as it becomes less likely that clients 
share knowledge of both exit and middle relays. 

We note that the results above represent the case in 
which clients only retrieve b = 1 blocks per PIR query. 
If the clients retrieve more blocks they can significantly 
improve their privacy protection (a grows approximately 
linearly with b). Moreover, if clients retrieve b > 1 
blocks each time, they divide by b the number of cir- 
cuits routed by each of the known exit relays. Finally, 
we would like to stress that client’s profiles are only link- 
able until they refresh their network knowledge. If, as in 
the current Tor network, this happens each 3 hours and 
circuits are rebuilt every 10 minutes, the adversary can 
link data from only 18/0 circuits. We have shown in this 
section that, even though it does not break anonymity, 
reusing descriptors breaks the unlinkability of circuits. 
In order to prevent the attack we have discussed, clients 
should request new blocks from the directory server (or 
from the guard nodes if ITPIR is used) often or in groups 
of several blocks such that the reuse of descriptors is min- 
imized. 


7 Performance Evaluation of Computa- 
tional PIR 


We now present experimental results for the CPIR archi- 
tecture. We chose standard security parameters for the 
CPIR scheme [1] (9 = 19 and N = 50), and computed 
the client/server computation times and communication 
costs by running an implementation of this scheme [15]. 
The hardware was a dual Intel Xeon E5420 2.50 GHz 
quad-core machine running Ubuntu Linux 10.04.1. Note 
that for our evaluation, we used only a single core, which 
is equivalent to a standard desktop machine today. 

We set the descriptor size to be 2 100 bytes (the maxi- 
mum descriptor size measured from the current Tor net- 
work), and set the exit database to be half the size of the 
middle database [45]. We varied the number of relays 
in a PIR database, and computed a) PIR server computa- 
tion, b) total communication, and c) client computation. 

Data transfer for CPIR schemes can be reduced us- 
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Figure 3: CPIR cost. R denotes the recursion parameter in CPIR. 
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ing the recursive construction by Kushilevitz and Ostro- 
vsky [21] without much increase in computational cost; 
this recursion can be implemented in a single round of in- 
teraction between the client and the server. We denote the 
recursion parameter in CPIR using R. If we denote the 
number of relays in the database by n, then the commu- 


nication cost of CPIR in our architecture is proportional 
to 82 . nV(R+), 


Figure 3 depicts the server computation, communica- 
tion, and client computation as a function of the number 
of Tor relays for varying values of the recursion param- 
eter R. Increasing R reduces communication (and client 
computation) drastically while having only a small im- 
pact on server computation. Note that for beyond R = 3, 
communication increases again, because the term 8” in 
the communication overhead becomes dominant. We can 
see that when the number of relays is less than 20 000, the 
server computational overhead using R = 2 is smaller 
than R = 3, while the communication overhead using 
R = 2 and R = 3 is about the same. Beyond 20000 
relays, using & = 3 results in significant communication 
savings as compared to R = 2, while the server compu- 
tational overhead is about the same for both parameters. 
For the remainder of this discussion, we use R = 2. We 
can see that as the network size scales, the communica- 
tion overhead of CPIR is an order of magnitude smaller 
than trivial download of the database. Interestingly, even 
at the current network size, the communication overhead 
of CPIR is smaller than a trivial download. 


Now we discuss the issue of creating multiple cir- 
cuits within a 3-hour interval (after which the directory 
databases are refreshed and clients request new descrip- 
tors). In this scenario, the trivial download has the ad- 
vantage that any number of circuits can be created. Tor 
clients rebuild a circuit after every 10 minutes, so they 
could create 18 circuits every 3 hours with the commu- 
nication overhead of a single trivial download. On the 
other hand, the PIR-based architecture would require 18 
PIR queries for middle nodes and another 18 for exit 
nodes. We can see that unless the number of relays in the 
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database is greater than 40000, trivial download is go- 
ing to be more efficient than performing multiple CPIR 
queries. Instead, we propose to perform b < 18 queries 
for both middle and exit nodes, and reuse existing blocks 
for more circuits. As we discuss in the security analysis, 
reusing blocks does not affect the anonymity of a single 
circuit, but may break the unlinkability of multiple cir- 
cuits. 


We now study some particular scaling scenarios in 
more detail. For each of the following scenarios, we 
will compute the number of cores required to support 
the clients. Figure 4 depicts the required number of 
CPU cores as a function of relays and clients. We also 
study the communication overhead of CPIR-Tor, along 
with a comparative analysis with the current Tor proto- 
col. For this analysis, we set the number of blocks b = 1. 
Note that both computation and communication over- 
head for CPIR-Tor scale linearly with the desired number 
of blocks. Our results are summarized in Table 1. 


Scenario 1: Current Tor Size. Total number of si- 
multaneous relays is 2000. Total number of simulta- 
neous clients is 250000. For 2000 relays, server com- 
pute time is 0.2 second. The number of exit nodes is 
around 1 000, and the corresponding server compute time 
is 0.1 seconds. Thus to download a block from both the 
middle and the exit databases, the total server compute 
time is 0.3 seconds. Note that we are proposing to down- 
load a block every 3 hours. A single directory server 
would thus be able to support 36 000 clients (260.00) 
The total number of cores required to support 250000 
clients is only 7. As of February 2011, the size of the 
Tor network consensus is 560 KB, while the total size 
of the relay descriptors is about 3.3 MB. Thus the com- 
munication overhead per client in the current Tor net- 
work is about 1.1 MB every 3 hours (560 KB consensus 
and sae KB relay descriptors ), while the corresponding 
overhead in our architecture is 2 MB. Thus, CPIR-Tor is 
not suited for the current Tor network size. 


Scenario 2: Increasing clients. Total number of re- 
lays is fixed at 2 000. Total number of clients increases 
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Table 1: Summary of results: Comparison of overhead in Tor, CPIR and ITPIR. The communication overhead is 
measured per client over a 3 hour interval. 








Scenario Relays Clients 1 ala free 
y (MB) (MB/Cores) (MB/ % Core Utilization per guard) 
1 2000 250 000 1.1 2/7 0.2 / 0.425% 
2 2000 2500000 1.1 2/70 0.2 / 4.25% 
3 20000 250000 11 4/59 0.5 / 0.425% 
4 20000 2500000 11 4/553 0.5 / 4.25% 
5 250000 250000 111 8 / 466 0.2 / 0.425% 


by a factor of s,. The number of cores required to sup- 
port s,- 250 000 clients is s, - 7 (linear increase). Thus if 
the number of clients increases to 2.5 million, about 70 
cores will be required to support the architecture. Both 
the number of cores and the communication overhead of 
the system increases linearly with the number of clients. 


Scenario 3: Increasing relays. Total number of 
relays increases by a factor of s,... Total number of 
clients is fixed. The number of cores required to support 
8, +2000 relays increases sublinearly with s,.. For exam- 
ple, when the number of relays increases from 2 000 to 
20 000, the required number of cores increases from 7 to 
59. Note that in this scenario, the communication over- 
head for CPIR-Tor also scales sublinearly, while that of 
current Tor scales linearly. Thus, as the number of relays 
increases, it becomes more and more advantageous to 
use CPIR-Tor. For instance, when the number of relays 
is 20 000, the communication overhead of Tor is 11 MB 
every 3 hours, while that of CPIR is only 4 MB. 


Scenario 4: Increasing both clients and relays. To- 
tal number of relays and clients increases by a factor 
of s. The number of cores required to support s - 2000 
relays and s - 250000 clients is strictly less than 7 - s?. 
In order to support 20 000 relays and 2.5 million clients, 
553 cores would be required. We note approximately 
50% of the Tor relays are already directory servers, so 
553 cores in this scenario is feasible. Again, as the num- 
ber of relays increases, the advantage of CPIR-Tor over 
Tor becomes larger. 


Scenario 5: Converting clients to middle-only re- 
lays. Observe that if all 250000 clients converted to 
middle-only relays, then the server compute time for the 
middle database is 20 seconds, while that for the exit 
database is still 0.1 seconds. Thus, the total number of 
cores required to support this scenario is approximately 
466. (This scenario is not shown in Figure 4.) As com- 
pared to the current Tor network, CPIR reduces the com- 
munication overhead in the network from 111 MB per 
client every 3 hours to only 8 MB. 
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Figure 4: Number of cores as a function of the number of 
relays and clients (assuming half of the relays are exits). 


8 Performance Evaluation of Information- 
Theoretic PIR 


We use an implementation [13] of the multi-server PIR 
scheme by Goldberg [12] and compute the server compu- 
tation, total communication, and client computation, for 
varying values of the number of relays, using a descriptor 
size of 2 100 bytes, and 3 servers. 

Figure 5 plots server computation, total communica- 
tion, and client computation as a function of the number 
of Tor relays, using 3 PIR servers (the entry guards). We 
note that the communication cost for a single ITPIR re- 
quest is at least 2 orders of magnitude smaller than the 
cost for a trivial download for all possible scaling sce- 
narios. 

Even if we compare the ITPIR-Tor protocol with the 
Tor protocol over a period of 3 hours, where clients set up 
18 circuits, still the communication overhead of ITPIR is 
an order of magnitude smaller than a full download for 
all scaling scenarios. Thus in this architecture, we do 
not need to reuse blocks, providing security equivalent 
to that of Tor, if at least a single guard relay is honest. 
Recall that if all guard relays are compromised, then the 
adversary can break user anonymity in both the current 
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Figure 5: 3-server ITPIR cost. 


Tor network as well as in PIR-Tor, by selectively deny- 
ing service [4] to circuits that have an honest exit relay 
(or destination server) and performing end-to-end timing 
analysis [2] when the exit relay (or destination server) is 
compromised. 


We now explore various scaling scenarios for Tor, and 
compute the number of clients that each guard relay can 
support, along with a comparison of the communication 
cost to that of Tor. Our results are summarized in Table 1. 


Scenario 1: Current Tor Size. Total number of re- 
lays is 2000. Total number of clients is 250000. For 
2000 relays, the number of exit nodes is around 1 000, 
and the corresponding server compute time is 0.005 sec- 
onds. Thus to support a single circuit, the total server 
compute time is 0.005 seconds (for all three guards com- 
bined). Note that each client builds a circuit every 10 
minutes. A single guard relay would thus be able to sup- 
port 360000 clients. In the current Tor network, there 
are 250000 clients, and approximately 500 guard relays, 
so each guard relay needs to service only 1500 clients on 
average, and would utilize only 0.425% of one core. The 
communication overhead of Tor is 1.1 MB per client ev- 
ery 3 hours. In ITPIR, the cost to build a single circuit is 
only 12 KB. Even if clients build 18 circuits over a three 
hour interval, the total communication cost of all 18 cir- 
cuits is 216 KB. Thus ITPIR is useful even with the size 
of the current Tor network. 


Scenario 2: Increasing clients. Total number of re- 
lays is fixed at 2 000. Total number of clients increases 
by a factor of s,. In order to support s, - 250000 clients, 
guard relays would need to utilize s, - 0.425% of a core. 
Thus even when the number of clients increases to 2.5 
million, but the number of guard relays stays fixed at 500, 
then each guard relay only utilizes a 4.25% fraction of a 
core. The total communication overhead in the system 
increases linearly with the number of clients, similar to 
the current Tor network. 


Scenario 3: Increasing relays. Total number of 
relays increases by a factor of s,. Total number of 
clients is fixed. In order to support s, - 2000 relays, 
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guard relays would need to utilize only 0.425% of a core. 
This is because the size increase in the PIR database 
is offset by the increase in the number of guard relays. 
Thus, regardless of the number of relays in the system, 
each guard relay utilizes only 0.425% of a core. Also, as 
the number of relays increases, the advantage of ITPIR 
over a full download in terms of communication cost also 
increases. For instance, at 2 000 relays ITPIR is a factor 
of 5 more efficient than Tor, while at 20000 relays, IT- 
PIR is a factor of 22 more efficient than Tor (516 KB per 
client every 3 hours as compared to 11.1 MB in Tor). 


Scenario 4: Increasing both clients and relays. To- 
tal number of relays and clients increases by a fac- 
tor of s. In order to support s - 250000 clients, and 
s - 2000 relays, each guard relay would need to utilize 
s - 0.425% of a core. Thus when the number of clients 
is 2.5 million, and the number of relays is 20 000, each 
guard relay utilizes 4.25% of a core. Even at 100 times 
the current client base (25 million), 42% of one core is re- 
quired, which may be reasonable in multi-core settings. 
As the number of clients increases, the communication 
overhead in both ITPIR and Tor increases linearly, while 
as the number of relays increases, it becomes a lot more 
advantageous to use ITPIR as compared to Tor. 


Scenario 5: Converting clients to middle-only re- 
lays. Observe that if all 250000 clients converted to 
middle-only relays, then the server compute time for the 
guard relays remains unchanged, since PIR is not per- 
formed over the middle database. Thus each guard relay 
would still utilize only 0.425% of a core. 


To further highlight the scalability of ITPIR, we also 
consider a scenario where all 250000 clients convert to 
relays, with a similar distribution of guard/middle/exit 
relays as in the current Tor network. The communication 
overhead of ITPIR in this scenario is 1.7 MB per client 
every 3 hours, while that of Tor is 137 MB — two orders 
of magnitude higher. 
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9 Discussion 


We now discuss some issues in, and ramifications of, our 
design. 


Comparison of CPIR vs. ITPIR. The CPIR-Tor ar- 
chitecture does not require all guard relays to be direc- 
tory servers, and is more easily integrated into the current 
Tor network, where a random subset of the relays are di- 
rectory servers. Moreover, it is ideal for the scenarios 
where either a client’s browsing time is small (possibly 
estimated using the client’s past Tor browsing history), 
or the client is not interested in the unlinkability of its 
connections. On the other hand, the ITPIR-Tor architec- 
ture requires all guard relays to be directory servers, thus 
requiring them to maintain a global view of the system, 
but results in significant communication savings for the 
clients. The ITPIR-Tor architecture can support a vari- 
ety of client workloads, while providing a high level of 
security. In particular, ITPIR-Tor can enable a very at- 
tractive scenario where all clients become middle-only 
relays, without any additional cost to the network, since 
the middle relays are fetched for free (without doing PIR) 
by the clients. 


Robustness. Recall that each block of the descriptor 
database is digitally signed by the trusted directory au- 
thorities. These signatures prevent malicious PIR servers 
from tricking clients into accepting false information. 
However, such malicious servers could still deny service 
to clients by returning garbage, or by not returning a re- 
sponse at all. As we discuss next, in both CPIR-Tor and 
ITPIR-Tor clients can easily detect this attack and can 
stop using those malicious servers. 

In CPIR-Tor, a malicious directory server could mod- 
ify its own copy of the descriptor database in order to cor- 
rupt blocks containing, for example, many honest nodes, 
and leave with correct signatures those blocks containing 
collaborating malicious nodes. Clients retrieving these 
“malicious blocks” will be successful, but clients retriev- 
ing “honest blocks” will not. In order to defend against 
this, a CPIR-Tor client that receives even one corrupted 
block (out of b requests) from a given (Byzantine) direc- 
tory server should discard the entire response, and make 
a new, freshly randomly chosen query for all b blocks 
from a different server. It should also avoid using that 
Byzantine server in the future. 

In ITPIR-Tor, on the other hand, such a selective- 
corruption attack is not possible unless all three guard 
nodes are colluding. In the ITPIR-Tor setting, Byzantine 
guard nodes can corrupt the result of the query, but not in 
a way that depends on which block was requested. Un- 
fortunately, with ITPIR-Tor as presented, although the 
client will detect the corruption, it will not learn which 
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of the guard nodes was Byzantine. This can be rectified, 
however, using the Byzantine robustness techniques of 
the underlying ITPIR protocol [12]. In particular, a client 
receiving blocks with correct signatures may safely use 
those blocks. If there are corrupted blocks, the client can 
identify which guard node(s) were Byzantine, and caus- 
ing the corruption, by extending the queries for just the 
corrupted blocks to additional guard nodes. When three 
honest guard nodes are reached, even though the client 
does not know a priori which are the honest ones, the 
Byzantine nodes will be identified. However, this may 
come at the cost of the Byzantine nodes (if there are at 
least three) learning which exit block the client was inter- 
ested in. Therefore, the client should not use the resulting 
information to build circuits; it should only use it to learn 
which nodes were Byzantine and thus should be avoided 
in the future. 


Additional scaling strategies. The Tor Project has 
been actively working on improving its scaling proper- 
ties. We now discuss some strategies under consideration 
that may be implemented in the future. The first strategy 
is to download relay descriptors on demand [34] during 
the circuit construction process, as opposed to periodi- 
cally fetching them in advance. Fetching descriptors on 
demand would significantly reduce the communication 
overhead in Tor. However, note that fetching descriptors 
on demand does not satisfy our goal of efficient circuit 
creation, since descriptor downloads increase circuit cre- 
ation times. 

The second strategy introduces the idea of microde- 
scriptors [8], which contain all relay descriptor fields that 
rarely change. All frequently changing fields are placed 
in the network consensus. Clients download the network 
consensus document frequently, but the microdescriptors 
are cached on a long-term basis. We note that this pro- 
posal is orthogonal to our architecture, and can be incor- 
porated in the PIR-Tor protocol. In this case, the PIR 
database would consist of only the network consensus 
information. The size reduction in the PIR database be- 
cause of the removal of microdescriptors would translate 
into both computational and communication savings in 
our architecture. 


Computational puzzles to prevent DoS. In our archi- 
tecture, directory servers act as PIR databases and per- 
form computation to respond to user queries. This pro- 
vides an opportunity to the attacker to launch a denial 
of service (DoS) attack against the directory servers by 
issuing multiple PIR queries. We propose to use com- 
putational puzzles to mitigate the impact of this attack. 
When a directory server begins to get computationally 
congested, it starts to issue computational puzzles to 
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clients. Clients solve the computational puzzle and re- 
turn the solution to the directory server. The directory 
server verifies the puzzle solution, and only then starts 
to spend computational resources to process the client’s 
PIR query. 


Impact of churn. In the current Tor network, as the 
churn in the network increases, clients will have to down- 
load the full list of network consensus and relay descrip- 
tors more frequently. On the other hand, the impact of 
churn on PIR-Tor is minimal, since only a small number 
of directory servers or guards will need to download the 
global view more frequently. In fact, as long as the rate 
of database updates is longer than 10 minutes (it is cur- 
rently set to 3 hours), we can expect the number of client 
PIR queries to be the same. 


Impact of number of circuits. The communication 
overhead of PIR-Tor is directly proportional to the num- 
ber of circuit constructions, since for optimal security, 
clients need to perform 1 or 2 PIR queries per circuit. 
Tor developers are already working on a proposal to have 
a separate circuit for each application, to prevent certain 
kinds of profiling [18]. In this scenario, since there is 
a separate circuit per application, the timeout period for 
each circuit can be increased from the current value of 
10 minutes, to keep the impact of additional circuits on 
our architecture minimal (since the timeout period is set 
to 10 minutes in order to prevent those same profiling 
attacks). 


Incorporating future path constraints. There have 
been several proposals that incorporate more constraints 
in the Tor path selection protocol. For example, it 
has been suggested that relays must be chosen to min- 
imize the chance of an end-to-end timing analysis at- 
tack [11,27]. Also, Sherr et al. [40] proposed to enable 
applications to choose relays based on different perfor- 
mance constraints like node-based selection, link-based 
selection, and end-to-end path-based selection. We note 
that PIR-Tor is able to incorporate these ideas to the ex- 
tent that each block fetched from the database contains 
multiple descriptors, and clients could apply similar al- 
gorithms to select the descriptor that best fits their con- 
straints. 


Preserving option to download global view. We note 
that many use cases may require a global view of the 
system. For example, it may be helpful to researchers or 
developers working on improving the security and per- 
formance of the Tor network to have a global view of 
the system. Thus we propose that directory servers also 
support an option to download the full database. 
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Limitations. The Tor network is comprised of volun- 
teer nodes that contribute their bandwidth for anony- 
mous communication. Our proposal essentially trades 
off bandwidth for computation at the directory servers, 
and thus directory servers are required to volunteer some 
extra computational resources. We show in our perfor- 
mance evaluation that only a small fraction of CPU re- 
sources need to be volunteered by the designated direc- 
tory servers, especially in the case of ITPIR-Tor. We be- 
lieve that PIR-Tor offers a good tradeoff between band- 
width and computational resources, and results in an 
overall reduction in resource consumption at volunteer 
nodes. Secondly, our design is not as scalable as alter- 
nate peer-to-peer approaches, which can scale to tens of 
million relays. However, our design provides improved 
security properties over prior work. In particular, reason- 
able parameters of PIR-Tor provide equivalent security 
to that of the Tor network. The security of our archi- 
tecture mostly depends on the security of PIR schemes 
which are well understood and relatively easy to analyze, 
as opposed to peer-to-peer designs that require analyzing 
extremely complex and dynamic systems. The only ex- 
ception to this is the scenario of CPIR-Tor with descrip- 
tor re-use, where the security analysis is more complex. 
Moreover, for all scaling scenarios, the communication 
overhead in our architecture is at least an order of mag- 
nitude smaller than that of Tor. Finally, PIR-Tor assumes 
the use of a small set of standard exit policies for nodes 
to select from, though a few outliers can be tolerated by 
downloading their information in their entirety. 


10 Conclusion 


In this paper, we presented PIR-Tor, an architecture for 
the Tor network where clients do not need to maintain a 
global view of the system, and instead leverage private 
information retrieval techniques to protect their privacy 
from compromised directory servers. In our evaluation, 
we find that PIR-Tor reduces the communication over- 
head of the Tor network by at least an order of magni- 
tude. We analyzed two flavors of our architecture, based 
on computational PIR and information-theoretic PIR re- 
spectively. In computational PIR, clients fetch only a few 
blocks from the PIR database, and reuse blocks to build 
additional circuits. While this modification has no im- 
pact on client anonymity, it slightly weakens the unlink- 
ability of circuits. On the other hand, in information- 
theoretic PIR, clients perform a PIR query per circuit 
creation and do not reuse blocks, resulting in a level of 
security that is equivalent to the Tor network. While 
information-theoretic PIR requires all guard relays to be 
directory servers, computational PIR is more easily inte- 
grated into the current Tor network. 
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The Phantom Tollbooth: Privacy-Preserving Electronic 
Toll Collection in the Presence of Driver Collusion 


Sarah Meiklejohn* | Keaton Mowery! 
UC San Diego UC San Diego 
Abstract 


In recent years, privacy-preserving toll collection has been 
proposed as a way to resolve the tension between the de- 
sire for sophisticated road pricing schemes and drivers’ 
interest in maintaining the privacy of their driving pat- 
terns. Two recent systems in particular, VPriv (USENIX 
Security 2009) and PrETP (USENIX Security 2010), use 
modern cryptographic primitives to solve this problem. In 
order to keep drivers honest in paying for their usage of 
the roads, both systems rely on unpredictable spot checks 
(e.g., by hidden roadside cameras or roaming police vehi- 
cles) to catch potentially cheating drivers. 

In this paper we identify large-scale driver collusion 
as a threat to the necessary unpredictability of these spot 
checks. Most directly, the VPriv and PrETP audit pro- 
tocols both reveal to drivers the locations of spot-check 
cameras — information that colluding drivers can then 
use to avoid paying road fees. We describe Milo, a new 
privacy-preserving toll collection system based on PrETP, 
whose audit protocol does not have this information leak, 
even when drivers misbehave and collude. We then evalu- 
ate the additional cost of Milo and find that, when com- 
pared to naive methods to protect against cheating drivers, 
Milo offers a significantly more cost-effective approach. 


1 Introduction 


Assessing taxes to drivers in proportion to their use of 
the public roads is a simple matter of fairness, as road 
maintenance costs money that drivers should expect to 
pay some part of. Gasoline taxes, currently a proxy for 
road use, are ineffective for implementing congestion 
pricing for city-center or rush-hour traffic. At the same 
time, the detailed driving records that would allow for 
such congestion pricing also reveal private information 
about drivers’ lives, information that drivers do seem to 
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have interest in keeping private. (In the U.S., for example, 
some courts have recognized drivers’ privacy interests by 
forbidding the police from using a GPS device to record 
a driver’s movements without a search warrant [1].) 

The VPriv [39] and PrETP [4] systems for private 
tolling, proposed at USENIX Security 2009 and 2010 
respectively, attempt to use modern cryptographic pro- 
tocols to resolve the tension between sophisticated road 
pricing and driver privacy. At the core of both these sys- 
tems is a monthly payment and audit protocol. In her 
payment, each driver commits to the road segments she 
traversed over the month and the cost associated with each 
segment, and reveals the total amount she owes. The prop- 
erties of the cryptography used guarantee that the total is 
correct assuming the segments driven and their costs were 
honestly reported, but that the specific segments driven 
are still kept private. 

To ensure honest reporting, the systems use an audit- 
ing protocol: throughout the month, roadside cameras 
occasionally record drivers’ locations; at month’s end, 
the drivers are challenged to show that their committed 
road segments include the segments in which they were 
observed, and that the corresponding prices are correct. 
So long as such spot checks occur unpredictably, drivers 
who attempt to cheat will be caught with high probability 
given even a small number of auditing cameras. In the 
audit protocols for both VPriv and PrETP, however, the 
authority reveals to each driver the locations at which 
she was observed. (The driver uses this information to 
open the appropriate cryptographic commitments.) If the 
cameras aren’t mobile, or are mobile but can be placed 
in only a small set of suitable locations (e.g., overpasses 
or exit signs along a fairly isolated highway), then the 
drivers will easily learn where the cameras are (and, per- 
haps more importantly, where they aren’t). Furthermore, 
if drivers collude and share the locations at which they 
were challenged, then a few audit periods will suffice 
for colluding drivers to learn and map the cameras’ loca- 
tions. 
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We believe the model of large-scale driver collusion is 
a realistic one. For example, drivers already collaborate 
to share the locations of speed cameras [37] and red-light 
cameras [38]; if we extend this behavior to consider maps 
of audit cameras, then we see that the unpredictable spot 
checks required in the analysis of VPriv and PrETP are 
difficult to achieve in the real world when drivers may col- 
lude on a large scale. When drivers know where cameras 
are (and where they aren’t), they will not pay for segments 
that are camera-free, and may even change driving pat- 
terns to avoid the cameras. By collaborating, drivers can 
discover and share camera locations at acceptable cost; 
in fact, if the cameras are revealed to them directly in the 
course of the audit protocol then they can do so without 
incurring a single fine. 

Finally, one might argue that an appropriate placement 
of audit cameras at chokepoints will make them impos- 
sible to avoid, even if their location is known; the price 
charged for traversing such a chokepoint could then be 
made sufficiently high that it subsidizes the cost of main- 
taining other, unaudited road segments. This alternative 
arrangement may seem superficially appealing, but it is 
ultimately incompatible with driver privacy. If drivers 
cannot avoid a chokepoint they cannot but be observed 
by authorities when they cross it; in other words, this 
approach would be feasibly enforceable only when most 
drivers are regularly observed at the chokepoints. In fact, 
what we have described is precisely the situation today 
in many cities, where tolls are collected on bridges and 
other unavoidable chokepoints. 


Our contribution We show, in Section 4, how to mod- 
ify the PrETP system to obtain our own system, Milo, 
in which the authority can perform an agreed-upon num- 
ber of spot checks of a driver’s road-segment commit- 
ments without revealing the locations being checked. To 
achieve this, we adapt a recent oblivious transfer proto- 
col due to Green and Hohenberger [28] that is based on 
blind identity-based encryption. We have implemented 
and benchmarked our modifications to the audit protocol, 
showing (in Section 5) that they require a small amount 
of additional work for each driver and a larger but still 
manageable amount of work for the auditing authority. 
Basic fairness demands that drivers whom the authority 
accuses of cheating be presented with the evidence against 
them: a photo of their car at a time and location for 
which they did not pay. This means that drivers who 
intentionally incur fines will inevitably learn some camera 
locations; in some cases, a large coalition of drivers may 
therefore profitably engage in such misbehavior. Here the 
information about camera locations is leaked not by the 
audit protocol but by the legal proceedings that follow it. 
Finally, if the cameras are themselves visible then 
drivers will discover and share their locations, regardless 
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of the cryptographic guarantees of the audit protocol.! 
All that is necessary is for one driver to spot the camera 
at any point during the month; the colluding drivers can 
then ensure that their commitments take this camera into 
account. We discuss this further in Section 6. 

In summary, our paper makes three concrete contribu- 
tions: 


e we identify large-scale driver collusion as a realistic 
threat to privacy-preserving tolling systems; 


e we modify the PrETP system to avoid leaking cam- 
era locations to drivers during challenges; and 


e we identify and evaluate other ways to protect 
against driver collusion and compare their costs to 
that of Milo. 


2 System Outline 


In this section we present an overview of the Milo system. 
We discuss both the organizational structure of the system, 
as well as the security goals it is able to achieve. As 
our system is built directly on top of PrETP we have 
approximately maintained its structure, with the important 
differences highlighted below. 


2.1 Organization 


Milo consists of three main players: the driver, repre- 
sented by an On-Board Unit (OBU); the company operat- 
ing the OBU (abbreviated TSP, for Toll Service Provider); 
and finally the local government (or TC, for Toll Charger) 
responsible for setting the road prices and collecting the 
final tolls from the TSP, as well as for ensuring fairness 
on the part of the driver. The interactions between these 
parties can be seen in Figure 1. 

In some respects, the organization of Milo is similar 
to that of current toll collection systems. The driver will 
keep a certain amount of money in an account with the 
TSP; at the end of every month the driver will then pay 
some price appropriate for how much she drove and the 
amount of money remaining in the account will need to 
be replenished. The major difference, of course, is that 
the payments of the driver do not reveal any information 
about their actual locations while driving.” In addition, we 
will require that the TC perform occasional spot checks 
to guarantee that drivers are behaving honestly. 

The OBU is a box installed in the car of the driver, 
which is responsible for collecting location information, 
computing the prices associated with the roads, and form- 
ing the final payment information that is sent to the TSP 


'De Jonge and Jacobs [19] appear to have been the first to note that 
unobservable cameras are crucial for random spot checks. 

?As also noted by Balasch et al. [4], the pricing structure itself 
may of course reveal driver locations —e.g., if segment i costs 2! (see 
Section 4), then all drivers’ paths are revealed by cost alone. This will 
likely not be a problem in practice. 
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Figure 1: An overview of how the Milo system works. 
As we can see, the OBU deals with the TSP for payment 
purposes (using the Pay protocol), but for spot checks it 
interacts with the TC (using the Audit protocol). The TC 
conducts these audits using both the information recorded 
by the cameras it operates along the roads and the OBU’s 
payment information, which is forwarded on from the 
TSP after it has been checked to be correct (using the 
VerifyPayment protocol). 


at the end of each month. Its work in this stage is de- 
scribed formally in our Pay algorithm, which we present 
in Section 4. 

The TSP is responsible for the collection of tolls from 
the driver. At the end of each month, the TSP will receive 
a payment message from the OBU as specified above. It 
is then the job of the TSP to verify that this payment in- 
formation is correct, using the VerifyPayment algorithm 
outlined in Section 4. If the payment information is found 
to be correctly formed then the TSP can debit the appro- 
priate payment from the user’s account; otherwise, they 
can proceed in a legal manner that is similar to the way in 
which traffic violations are handled now. 

The TC, as mentioned, is the local government respon- 
sible for setting the prices on the roads, as well as the 
fines for dishonest drivers who are caught. The TC is 
also responsible for performing spot checks to ensure that 
drivers are behaving honestly. Although this presents a 
new computational burden for the TC (as compared to 
PrETP, for example, which has the TSP performing the 
spot checks), we believe that it is important to keep all lo- 
cation information completely hidden from the TSP, as it 
is a business with incentive to sell this information. Since 
the TC already sees where each car is driving regardless 
of which body performs the spot checks (since it is the 
one operating the cameras), having it perform the audits 
itself minimizes the privacy lost by the driver. 

Note, however, that the formal guarantees of correct- 
ness, security, and privacy provided by our system do not 
depend on having the TSP and TC not collaborate. In fact, 
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both roles could be performed by a single organization. 
Since in practice businesses such as E-ZPass play the role 
of TSP, we recommend the separation of duties above to 
avoid giving the TSP an incentive to monetize customers’ 
driving records. Of course, this assumes that regulation 
or the courts will forbid the government from misusing 
the information it collects. 


2.2 Security model 


In any privacy-preserving system, there are two goals 
which are absolutely essential to the success of the sys- 
tem: maintaining privacy, while still keeping users of the 
system honest. We discuss what this means in the context 
of electronic toll collection in the following two points: 


e Driver privacy: Drivers should be able to keep their 
locations completely hidden from any other drivers 
who may want to intercept (and possibly modify) 
their payment information on its way to the TSP. 
With the exception of the random spot checks per- 
formed by the audit authority (in our case the TC), 
the locations of the driver should also be kept pri- 
vate from both the TC and the TSP. This property 
should hold even for a malicious TSP; as for the TC, 
we would like to guarantee that, as a result of the 
audit protocol, it learns only whether the driver was 
present at certain locations and times of its choice, 
even if it is malicious. The number of these locations 
and times about which the TC can query is fixed and 
a parameter of the audit protocol. An honest-but- 
curious TC will query the driver at those locations 
and times where she was actually observed, but a 
malicious TC might query for locations where no 
camera was present; see Section 4.3 for further dis- 
cussion. 


e Driver honesty: Drivers should not be able to tam- 
per with the OBU to produce incorrect location or 
price information; i.e., pretending they were in a 
given location, using lower prices than are actually 
assigned, or simply turning off the OBU to pretend 
they drove less than they actually did. This property 
should hold even if drivers are colluding with other 
dishonest drivers, and should in fact hold even if 
every driver in the system is dishonest. 


These security goals should look fairly similar to those 
outlined in previous work (e.g., PrETP or VPriv [39], 
and inspired by the earlier work of Blumberg, Keeler, 
and shelat [8]), but we note the consideration of possibly 
colluding drivers as an essential addition. We also note 
that we do not consider physical attacks (i.e., a malicious 
party gaining physical access to a driver’s car) in this 
model, as we believe these attacks to be out of scope. 
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For ideal privacy, the locations of each driver would 
be kept entirely private even from the TC. This does not 
seem to be possible, however, as it would allow drivers to 
behave dishonestly without any risk of getting caught. Be- 
cause each camera does take away some degree of privacy 
from the driver, we would like to minimize the number of 
cameras operated by the TC; at the same time, we need 
to keep it high enough so that the TC will have a very 
good chance of catching any cheating drivers. We believe 
this to be a fundamental limitation on the value of any 
privacy-preserving tolling system, however, as they are 
privacy preserving only when the spot-check cameras do 
not monitor such a large fraction of trips that the records 
themselves constitute a substantial privacy violation. As 
Blumberg, Keeler, and shelat write, “Extensive camera 
networks are simply not compatible with the kinds of pri- 
vacy we demand since they collect too much information. 
If misused, they can provide adequate data for real-time 
tracking of vehicles” [8]. 

Finally, we note that these security properties are both 
achieved by Milo, under the assumption that cameras are 
randomly placed and invisible to drivers (i.e., the only 
way camera locations can leak to drivers is during the 
audit protocol). We discuss the potential issues with this 
assumption in Section 6. 


3 Cryptographic Background 


Because our scheme follows closely the PrETP construc- 
tion [4], we employ the same modern cryptographic 
primitives as they do: commitment schemes and zero- 
knowledge proofs, in addition to the more familiar primi- 
tive of digital signatures [26]. In addition, to keep the spot- 
check camera locations hidden from drivers, we make use 
of another primitive, blind identity-based encryption, ina 
manner that is inspired by the oblivious transfer protocol 
of Green and Hohenberger [28]. 


3.1 Commitments 


A commitment scheme is essentially the cryptographic 
relative of an envelope, and consists of two main phases: 
forming the commitment and opening the commitment. 
First, to form a commitment to a certain value, a user 
Alice can put the value in the envelope and then seal 
the envelope; to keep the analogy going, let’s also as- 
sume she sealed it in some special way such that only 
she can open it. The sealed envelope then acts as her 
commitment, which she can send on to another user Bob. 
When the time comes, Alice can reveal the committed 
value by opening the envelope and showing Bob its con- 
tents. There are two properties that commitment schemes 
satisfy: hiding and binding. The hiding property says 
that, because Alice is the only one who can unseal the 
envelope, Bob will not be able to learn any information 
about its contents before she reveals them. In addition, 
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the binding property says that, because the envelope is 
sealed, Alice will not be able to open it, change the value 
inside, and give it back to Bob without him noticing. In 
other words, when Alice finally reveals the opening of 
the commitment, Bob can be satisfied that those were 
the values inside all along. We will use the notation 
c = Com(m;r) to mean that c is a commitment to the 
message m using some randomness r. (Note that there 
are also some parameters involved, but that here and 
in the primitives that follow we omit them for simplic- 
ity.) 

One more property we will require of our commitment 
schemes is that they are additively homomorphic. This 
means that there is an operation on commitments, call 
it HH, such that if cy is a commitment to m, and c2 is a 
commitment to mz, then c; Hc? will be a commitment to 
my, +mz. This property can be achieved by a variety of 
schemes; to best suit our purposes, we work with Fujisaki- 
Okamoto commitments [18, 22], which rely on the Strong 
RSA assumption for their security. 
































3.2 Zero-knowledge proofs 


Our second primitive, zero-knowledge proofs [24, 25], 
provides a way for someone to prove to someone else 
that a certain statement is true without revealing anything 
beyond the validity of the statement. For example, a user 
of a protected system might want to prove the statement “T 
have the password corresponding to this username” with- 
out revealing the password itself. The two main prop- 
erties of zero-knowledge proofs are soundness and zero 
knowledge. Soundness guarantees that the verifier will 
not accept a proof for a statement that is false; in the 
above example, this means that the system will accept the 
proof only if the prover really does have the password. 
Zero knowledge, on the other hand, protects the prover’s 
privacy and guarantees that the system in our example 
will not learn any information about the password itself, 
but only that the statement is true. A non-interactive 
zero-knowledge proof (NIZK for short) is a particularly 
desirable type of proof because, as the name indicates, 
it does not require any interaction between the prover 
and the verifier. For a given statement S, we will use the 
notation z = NIZKProve(S) to mean a NIZK formed by 
the prover for the statement S. Similarly, we will use 
NIZKVerify(z,S) to mean the process run by the verifier 
to check, using 7, that S is in fact true. In our system 
we will need to prove only one type of statement, often 
called a range proof, which proves that a secret value x 
satisfies the inequality lo < x < hi, where lo and hi are 
both public. For this we can use Boudot range proofs 
and their extensions [11, 34], which are secure in the 
random oracle model [6] and assuming the Strong RSA 
assumption. 
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3.3 Blind identity-based encryption 


Finally, to maintain driver honesty even in the case of 
possible collusions between drivers (as discussed in Sec- 
tion 2), we use an additional cryptographic primitive: 
identity-based encryption [10, 42]. Intuitively, identity- 
based encryption (IBE for short) extends the notion of 
standard public-key encryption by allowing a user’s pub- 
lic key to be, rather than just a random collection of bits, 
some meaningful information relevant to their identity; 
e.g., their e-mail address. This is achieved through the 
use of an authority, who possesses some master secret key 
msk and can use it to provide secret keys corresponding 
to given identities on request (provided, of course, that 
the request comes from the right person). When we work 
with IBE, we will use the syntax C = IBEnc(id;m) to 
mean an identity-based encryption of the message m, in- 
tended for the person specified in the identity id. We will 
similarly use m = |BDec(skjg;C) to mean the decryption 
of C using the secret key for the identity id. 

Because of how IBE is integrated into our system, we 
will need the IBE to be augmented by a blind extraction 
protocol: a protocol interaction between a user and the 
authority at the end of which the user obtains the secret 
key corresponding to some identity of her choice, but 
the authority does not learn which identity was requested 
(and also does not learn the secret key for that identity). 
This process of getting the secret key will be denoted as 
skiq = BlindExtract(id), keeping in mind that the author- 
ity learns neither id nor skjg. As we show in Section 4, 
this property (introduced by Green and Hohenberger [28]) 
is crucial for guaranteeing that drivers do not learn where 
the TC has its cameras. 

Furthermore, we would like our IBE to be anony- 
mous [2], meaning that given a ciphertext C, a user cannot 
tell which identity the ciphertext is meant for (so, in par- 
ticular, they cannot check to see if a guess is correct). 
Again, as we show in Section 4, this property is necessary 
to ensure that the TSP cannot simply guess and check 
where the driver was at a given time, and thus potentially 
learn information about her whereabouts. 

To the best of our knowledge, there are two blind and 
anonymous IBEs in the cryptographic literature: the first 
due to Camenisch, Kohlweiss, Rial, and Sheedy [13] 
and the second to Green [27]; both are blind variants on 
the Boyen-Waters anonymous IBE [12]. While either of 
these schemes would certainly work for our purposes, 
we chose to come up with our own scheme in order to 
maximize efficiency. Our starting point is the Boneh- 
Franklin IBE [10], which is already anonymous [2, Sec- 
tion 4.5]. We then introduce a blind key-extraction pro- 
tocol for Boneh-Franklin, based on the Boldyreva blind 
signature [9]. Finally, we “twin” the entire scheme to es- 
sentially run two copies in parallel; this is just to facilitate 
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a “Twin Diffie-Hellman” style security proof [15]. We 
give a full description of our scheme in the full version 
of our paper [36], as well as a proof of its security in a 
variant of the Green-Hohenberger security model. Our 
IBE is conveniently efficient, but we stress that the Milo 
system could be instantiated with any provably secure 
IBE that is both blind and anonymous (and in particular 
the schemes of Camenisch et al. and Green which, while 
not as efficient as our scheme, have the attractive proper- 
ties that they use significantly weaker assumptions and do 
not rely on random oracles in their proofs of security). 

In the broadest sense, our blind IBE can be viewed 
as a special case of a secure two-party computation be- 
tween the OBU and the TC, at the end of which the TC 
learns whether or not the driver paid honestly for a given 
segment, and the driver learns nothing (and in particular 
does not learn which segment the TC saw her in). As 
such, any efficient instantiation of this protocol as a se- 
cure two-party computation would be sufficient for our 
purposes. One promising approach, suggested by an anoy- 
mous reviewer, uses an oblivious pseudorandom function 
(OPRF for short) as a building block. With an OPRF, a 
user with access to a seed k for a PRF f and another user 
with input x can securely evaluate f;(x) without the first 
user’s learning x or f;,(x), and without the second user’s 
learning the seed k; this can be directly applied to our set- 
ting by treating the seed k as a value known by the OBU, 
and the input x as the segment in which the TC saw the 
driver. An efficient OPRF was recently given by Jarecki 
and Liu [32]. Compared to our approach, the OPRFs of 
Jarecki and Liu may require increased interaction (which 
has implications for concurrent security) and potentially 
more computation than ours. 


4 Our Construction 


In this section, we describe the various protocols used 
within our system and how they meet the security goals 
described in Section 2.2; we note that only Algorithm 4.3 
substantially differs from what is used in PrETP. There 
are three main phases we consider: the initialization of 
the OBU, the forming and verifying of the payments per- 
formed by the OBU and the TSP respectively, and the 
audit between the TC and the OBU. Below, we will detail 
the functioning of each of these algorithms; first, though, 
we give some intuition for how our scheme works and 
why the use of blind identity-based encryption means the 
audit protocol does not leak the locations of spot-check 
cameras to drivers. 

In the audit protocol, the driver needs to show that her 
actual driving is consistent with the fee she chose to pay. 
To do this, she must upload her (claimed) driving history 
to the TSP’s server; if she didn’t, the TSP would have 
nothing to check the correctness of. Obviously, simply 
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uploading this history in the clear would provide no pri- 
vacy. The VPriv system sidesteps this by having the driver 
upload the segments anonymously (using an anonymizing 
service such as Tor [20]), accompanied by a “tag” that 
will allow her to claim them as her own. We instead fol- 
low PrETP in having the driver upload a commitment 
of sorts to each of her segments. In addition, the driver 
commits to the cost associated with each segment using 
the additively homomorphic commitment scheme. Check- 
ing that the total payment is the sum of the fees for each 
committed segment is now easy: using the homomorphic 
operation H, the TSP can compute a commitment to the 
sum of the committed fees; the driver then provides the 
opening of this sum commitment, showing that its value 
is the fee she paid.* 

What remains is to prove that the committed segments 
the driver uploaded to the server are in fact the segments 
she drove, and that the committed fee she uploaded along- 
side each is in fact the fee charged for driving it. Fol- 
lowing VPriv, PrETP, and de Jonge and Jacobs’ system 
(see Section 7), we rely on spot check cameras. The TC’s 
cameras observed the driver at a few locations over the 
course of the month. It now challenges the driver to show 
that these locations are among the committed segments, 
and that the corresponding committed fees are correct. If 
the driver cannot show a commitment that opens to one of 
these spot check locations, she has been caught cheating; 
if the spot check locations are unpredictable then a simple 
probability analysis (see Section 6.1) shows that a cheat- 
ing driver will likely be caught. In PrETP, the spot check 
has the TC sending to the driver the locations and times 
where she was observed; the driver returns the index and 
opening of the corresponding committed segments. This, 
of course, leaks the spot check locations to the driver. To 
get around this, we must somehow transmit the appro- 
priate openings to the TC without the driver finding out 
which commitments are being opened. 

Identity-based encryption allows us to achieve exactly 
the requirement above. Along with each of her commit- 
ments, the driver encrypts the opening of the commitment 
using IBE; the identity to which a commitment is en- 
crypted is the segment location and time. She sends these 
encrypted openings to the TC along with the commit- 
ments themselves. (Note that it is crucial the ciphertext 
not reveal the identity to which it was encrypted, since 
otherwise the TSP and TC would learn the driver’s entire 
driving history. This is why we require an anonymous 
IBE.) Now, if the TC had the secret key for the identity 
corresponding to the place and time where the driver was 
spotted, it could decrypt the appropriate ciphertext, ob- 
tain the commitment opening value, and check that the 

















3There is a technicality here: range proofs are needed to prevent the 
driver from artificially reducing the amount she owes by committing to 
a few negative values. See Section 4.2 for more on this. 
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corresponding commitment was properly formed. But 
the TC can’t ask the driver for the secret key, since this 
would also leak the spot-check location. Instead, it en- 
gages with the driver in a blind key-extraction protocol. 
The TC provides as input the location and time of the spot 
check and obtains the corresponding secret key without 
the driver learning which identity (i.e., location and time) 
was requested. By undertaking the blind extraction proto- 
col only a certain number of times, the driver limits the 
number of spot checks the TC can perform. 

Note that this is essentially an oblivious transfer proto- 
col; our solution is in fact closely related to the oblivious 
transfer protocol of Green and Hohenberger [28], who 
introduced blind IBE. 

Before any of the three phases can take place, the TC 
first decides on the segments used for payment and how 
much each one actually costs. It starts by dividing each 
road into segments of some appropriate length, for exam- 
ple one city block in denser urban areas or one mile along 
a highway in less congested areas. Because prices might 
change according to time of day, the TC also decides on a 
division of time into discrete quanta based on some “time 
step” when a new segment must be recorded by the OBU 
(even if the location endpoint has not yet been reached). 
For example, if two location endpoints are set as Exit 17 
and Exit 18 on a highway and the time step is set to be 
a minute, then a driver traveling between these exits for 
more than a minute will have segments with the same 
location endpoints, but different time endpoints. In par- 
ticular, if this driver starts at 22:00 and takes about three 
minutes to get from one exit to the other, she will end up 
with three segments:* 


e ((exit 17,exit 18), (22:00, 22:01)); 
e ((exit 17,exit 18), (22:01,22:02)); and 
e (exit 17,exit 18), (22:02,22:03)). 


Each segment is of the form ((Joc} ,locz), (time), time2)); 
in the future, we denote these segments as (where, when), 
where where represents the physical limits of the seg- 
ment and when represents the particular time quantum 
during which the driver was in the segment where. For 
each of these segments, the TC will have assigned some 
price; this can be thought of as a publicly available func- 
tion f : (where,when) — [0,M], where M is the maxi- 
mum price assigned by the TC. 


4.1 Initialization 


Before any payments can be made, there are a number of 
parameters that need to be loaded onto the OBU. To start, 


4In practice, the segment information will of course be more detailed; 
as a byproduct of using GPS anyway, each car will have access to precise 
coordinate and time information (including date). 
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the OBU will be given some unique value to identify itself 
to the TSP; we refer to this value as tag. Because the OBU 
will be signing all the messages it sends, it first needs to 
generate a signing keypair (vkjag, Sktag); the public verifi- 
cation key will need to be stored with both the TSP and 
TC, while the signing key will be kept private. We will 
also use an augmented version of the BlindExtract pro- 
tocol (mentioned in Section 3.3) in which the OBU and 
TC will sign their messages to each other, which means 
the OBU will need to have the verification key for the 
TC stored as well (alternatively, they could just commu- 
nicate using a secure channel such as TLS). In addition, 
the OBU will need to generate parameters for an IBE 
scheme in which it possesses the master secret key msk, 
as well as to load the parameters for the commitment and 
NIZK schemes (note that it is important the OBU does 
not generate these latter parameters itself, as otherwise 
the driver would be able to cheat). Finally, the OBU will 
also need to have stored the function f used to define road 
prices. 


4.2 Payments 


Once the OBU is set up with all the necessary parame- 
ters, it can begin making payments. As the driver travels, 
the GPS picks up location and time information, which 
can then be matched to segments (where,when). For 
each of these segments, the OBU first computes the cost 
for that segment as p = f(where,when). It then com- 
putes a commitment c to this value p; we will refer to the 
opening of this commitment as open,. Next, the OBU 
computes an identity-based encryption C of the open- 
ing open, along with a confirmation value 04, using the 
identity id = (where, when). Finally, the OBU computes 
a non-interactive zero-knowledge proof 7 that the value 
contained in c is in the range [0,M]. This process is then 
repeated for every segment driven, so that by the end 
of the month the OBU will end up with a set of tuples 
{ (ci,Ci,m) }"_,. In addition to this set, the OBU will also 
need to compute the Seeane OPeN Final for the commit- 
ment Cfingi = C) Hcp H---Hc,; i.e., the opening for the 
commitment to the sum of the prices, which effectively 
reveals how much the driver owes. The OBU then cre- 
ates the final message m = (tag, openfnal; { (ci,Ci, m)}.), 
signs it to get a signature 0,,, and sends to the TSP the 
tuple (m,0,,). This payment process is summarized in 
Algorithm 4.1. The parameter 1, set to 160 for 80-bit 
security, is explained below. 

Once the TSP has received this tuple, it first looks up 
the verification key for the signature using tag. If it is 
satisfied that this message came from the right OBU, then 
it performs several checks; if not, it aborts and alerts 
the OBU that something went wrong (i.e., the message 
was manipulated in transit) and it should resend the tuple. 
Next, it checks that each commitment c; was properly 
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Algorithm 4.1: Pay, run by the OBU 


Input: segments { (where;,when;) 14 ; 
signing key sktag 
1 forall 1 <i<ndo 
2 pi = f (where;, when;) 
3 q= Com(p;; ri) 
4 Ci = IBEnc((where;, when); (pisri;0*)) 
5 7m = NIZKProve(0 < p; < M) 
6 OPeN inal = (ype) 
7m= (tag, openfinat: TCs m%)},_1) 
8 Om = Sign(skiag,m) 
9 return (m, On) 


identifier tag, 





formed by acting as the verifier for the NIZK 7;; if one of 
these checks failed then it knows that the driver committed 
to an incorrect price (for example, a negative price to try 
to drive down her monthly bill). The TSP then performs 
the homomorphic operation on the commitments to get 
Cfinal = C1 HH c2 HA--- Hcp and checks that OPEN final is the 
opening for Cfnqi. If all these checks pass, the TSP can 
debit Pyinai (contained in OPEN final) from the user’s account; 
if not, something has gone wrong and the TSP can flag the 
driver as suspicious and continue on to legal proceedings, 
as is done with current traffic violations. This algorithm 
is summarized in Algorithm 4.2. 

In terms of privacy, the hiding property of the com- 
mitment scheme and the zero knowledge property of 
the NIZK scheme guarantee that the driver’s informa- 
tion is being kept private from the TSP. Furthermore, the 
anonymity of the IBE scheme guarantees that, although 
the segments are used as the identity for the ciphertexts C;, 
the TSP will be unable to learn this information given just 
these ciphertexts. In addition, some degree of honesty is 
guaranteed. First, because the message was signed by the 
OBU, the TSP can be sure that the tuple came from the 
correct driver and not some other malicious driver trying 
to pass herself off as someone else (or cause the first driver 
to pay more than she owes). Furthermore, if all the checks 
pass then the binding property of the commitment scheme 
and the soundness property of the NIZK scheme guaran- 
tee that the values contained in the commitments are to 
valid prices and so the TSP can be somewhat convinced 
that the price Pfingi given by the driver is the correct price 
she owes for the month. The TSP cannot, however, be 
convinced yet that the driver did not simply turn off her 
OBU or otherwise fake location or price information; for 
this, it will need to forward the payment tuple to the TC, 
which initiates the audit phase of the protocol. 
































4.3 Auditing 


As we argued in Section 2.2, although the audit protocol 
does take away some degree of privacy from the driver, 
this small privacy loss is necessary to ensure honesty 
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Algorithm 4.2: VerifyPayment, run by the TSP 
Input: payment tuple (m,o,,), verification key vkjag 
if SigVerify(vkiag,m, Om) = 0 then 
return | 
parse m as (tag, openfinats 1 (on, it 4) 
forall 1 <i<ndo 
if NIZK Verify ((c;, ™),0 < pi < M) = 0 then 
return suspicious 
Cfinal = C1 -+ cy 
if Cfinal = Com (opening) then 
Parse OPeN fing) AS (Pfinal’ final) 
debit account for tag by Pfinal 
11 return okay 
12 else 
13 return suspicious 
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within the system. We additionally argued that the TC 
should not reveal to the driver the locations of the cameras 
and furthermore believe that the driver should not even 
learn the number of cameras at which the TC saw her, as 
even this information would give her opportunity to cheat 
(for more on this see Section 6). We therefore assume 
that the TC makes some fixed number of queries k for 
every driver, regardless of whether or not it has in fact 
seen the driver k times. To satisfy this assumption, if the 
TC has seen the driver on more than k cameras, it will 
just pick the first k (or pick k at random, it doesn’t matter) 
and query on those. If it has seen the driver on fewer 
than k cameras, we can designate some segment to be a 
“dummy” segment, which essentially does not correspond 
to any real location/time tuple. The TC can then query on 
this dummy segment until it has made k queries in total; 
because the part of the protocol in which the TC performs 
its queries is blind, the OBU won’t know that it is being 
queried on the same segment multiple times. 

After the TSP has forwarded the OBU’s payment tu- 
ple to the TC, the TC first checks that the message re- 
ally came from the OBU (and not, for example, from 
a malicious user or even the TSP trying to frame the 
driver). As with the TSP, if this check fails then it can 
abort the protocol and alert the OBU or TSP. It then ex- 
tracts the tuples {(c;,C;,7;)} from m and begins issuing 
its random spot checks to ensure that the driver was not 
lying about her whereabouts. This process is outlined 
in Algorithm 4.3. Because there were a certain number 
of cameras the driver passed, the TC will have a set of 
tuples {(loc;,time;)} of its own that correspond to the 
places and times at which the TC saw the driver. First, 
for every pair (loc, time), the TC will need to determine 
which segment this pair belongs to; this then gives it 
a set { (where;, when;) } of tuples that the driver would 
have logged if they were behaving honestly (unless the 
set has been augmented by the dummy segment as de- 
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scribed above, in which case the OBU clearly will not 
have logged this segment). 


After the TC has this set of tuples, it uses the identity- 
based encryption C; contained within every tuple sent 
by the OBU. Recall from Algorithm 4.1 that the iden- 
tity corresponding to each encryption is the segment 
(where ;,when;), and that the encryption itself is of the 
opening of the commitment c; (contained in the same 
tuple), along with a confirmation value 0%. Therefore, if 
the TC can obtain the secret key skjg from the OBU for 
the identity id = (where ;, when,), then it can successfully 
decrypt the ciphertext and obtain the opening for the com- 
mitment, which it can then use to check if the driver is 
recording correct price information. Because the TC does 
not know which ciphertext corresponds to which segment, 
however, once the TC obtains this secret key it will then 
need to attempt to decrypt each C;. 


To prevent drivers from using a single commitment 
to pay for two segments, we require that it be compu- 
tationally difficult to find a ciphertext C that has valid 
decryptions under two identities id, and id. For our IBE, 
it is sufficient to encrypt a confirmation value 04 along 
with the message (where A = 160 for 80-bit security), 
since messages are blinded with a random oracle hash 
that takes the identity as input. On decryption, one checks 
that the correct confirmation value is present. Note that 
we do not require CCA security. 


If C; does decrypt properly for some j, then the TC 
checks that the value contained inside is the opening of 
the commitment c;. If itis, then the TC further checks that 
the price p; is the correct price for that road segment by 
computing f(where;,when;). If this holds as well, then 
the TC can be satisfied that the driver paid correctly for 
the segment of the road on which she was seen and move 
on to the next camera. If it does not hold, then the TC 
has reason to believe that the driver lied about the price 
of the road she was driving on. If instead the opening 
is not valid, the TC has reason to believe that the driver 
formed either the ciphertext C; or the commitment c; 
incorrectly. Finally, if none of the ciphertexts properly 
decrypted using skjq (i.e., C; did not decrypt for any value 
of j), then the TC knows that the driver simply omitted the 
segment (where ;,when;) from her payment in an attempt 
to pretend she drove less. In any of these cases, the 
TC believes the driver was cheating in some way and 
can undertake legal proceedings. If all of these checks 
pass for every camera, then the driver has successfully 
passed the audit and the TC is free to move on to another 
user. 


In terms of driver honesty, the addition of BlindExtract 
allows the TC to obtain skjg without the OBU learning the 
identity, and thus the location at which they were caught 
on camera. As argued in Section 2, this is absolutely cru- 
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Algorithm 4.3: Audit, run by the TC 


Input: payment tuple (m, o,,), camera tuples 
{ (loc;, time;) }\_,, verification key vkiag 
1 if SigVerify(vkiag,m, Om) = 0 then 
2 return | 
3 Parse m as (tag, opeNjinal: {(cj,Cj, 1) }i-1) 
4 forall 1 <i<kdo 
5 determine segment (where;,when;) for 
(loc;, time;) 
6 sk; = BlindExtract(where;, when;) 
7 match = 0 
8 forall 1 < j<ndo 
9 mj; = |BDec(sk;;C;) 


10 if mj parses as (p;;r;;0*) then 
11 match = | 

2 if Com(m;) Ac; then 

13 return suspicious 

14 if p; A f (where;,when;) then 
15 return suspicious 

16 break 

17 if match = 0 then 

18 return suspicious 


19 return okay 





cial for maintaining driver honesty, both individually and 
in the face of possible collusions. In terms of privacy, if 
the OBU and TC sign their messages in the BlindExtract 
phase, then we can guarantee that no malicious third party 
can alter messages in their interaction in an attempt to 
learn the segment in which the driver was caught on cam- 
era (or, alternatively, frame the driver by corrupting skjq). 
As mentioned in Section 2, whereas the cameras do take 
away some part of the driver’s privacy, they are necessary 
to maintain honesty; we also note that no additional in- 
formation is revealed throughout the course of this audit 
interaction provided both parties behave honestly. One 
potential downside of this protocol, however, is that the 
TC is not restricted to querying locations at which it had 
cameras; it can essentially query any location it wants 
without the driver’s knowledge (although the driver is at 
least aware of how many queries are being made). We 
believe that our system could be augmented to resist such 
misbehavior through an “audit protocol audit protocol” 
that requires the TC to demonstrate that it actually has 
camera records corresponding to some small fraction of 
the spot check it performs, much as its own audit protocol 
requires the driver to reveal some small fraction of its 
segments driven. This “audit audit” could be performed 
on behalf of drivers by an organization such as EFF or 
the ACLU; alternatively, in some legal settings an exclu- 
sionary rule could be introduced that invalidates evidence 
obtained through auditing authority misbehavior. 
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Time (ms) 
Operation Laptop ARM 
Creating parameters 75.12 1083.61 
Encryption 82.11 1187.82 
Blind extraction (user) 13.13 214.06 
Blind extraction (authority) 11.21 175.25 
Decryption 78.31 = 1131.58 


Table 1: The average time, in milliseconds and over a run 
of 10, for the various operations in our blind IBE protocol, 
performed on both a MacBook Pro and an ARM vVSTE. 
The numbers for encryption and decryption represent the 
time taken to encrypt/decrypt a pair of 1024-bit numbers 
using the curve y? = x3 +x mod p at the 80-bit security 
level, and the numbers for blind extraction represent the 
time to complete the computation required for each side 
of the interactive protocol. 


5 Implementation and Performance 


In order to achieve a more effective audit protocol, an 
extra computational burden is required for both the OBU 
and the TC. In this section, we consider just how great this 
additional burden is; in particular, we focus on our blind 
identity-based encryption protocol from the full version 
of our paper [36], as well as Algorithm 4.3 from Sec- 
tion 4.3. The benchmarks presented for these protocols 
were collected on two machines: a MacBook Pro running 
Mac OS X 10.6 with a 2.53 GHz Intel Core 2 Duo proces- 
sor and 4GB of RAM, and an ARM vSTE running Linux 
2.6.24 with a 520MHz processor and 128 MB of RAM. 
We believe that the former represents a fairly conserva- 
tive estimate for the amount of computational resources 
available to the TC, whereas the latter represents a ma- 
chine that could potentially be used as an OBU. For the 
bilinear groups needed for blind IBE we used the supersin- 
gular curve y? = x* +x mod p for a large prime p (which 
has embedding degree 2) within version 5.4.3 of the MIR- 
ACL library [41], and for the NIZKs and commitments we 
used ZKPDL (Zero-Knowledge Proof Description Lan- 
guage) [35], which itself uses the GNU multi-precision 
library [23] for modular arithmetic. 

Table | shows the time taken for each of the unit oper- 
ations performed within the IBE scheme. As mentioned 
in Section 4, in the context of our system the creation 
of the parameters will be performed when the OBU is 
initialized, the encryption will be performed during the 
Pay protocol (line 4 of Algorithm 4.1), and both blind 
extraction and decryption will be performed in the audit 
phase between the TC and the OBU (lines 6 and 9 of 
Algorithm 4.3 respectively). 

We consider the computational costs for the OBU and 
the TC separately, as well as the communication overhead 
for the whole system.> 


5We do not consider the computational costs for the TSP here, as 
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OBU computational costs. During the course of a 
month (or however long an audit period is), the OBU is 
required to spend time performing computations for two 
distinct phases of the Milo protocol. The first phase is the 
Pay protocol, which consists of computing the commit- 
ments to segment prices, encrypting the openings of the 
commitments, and producing a zero-knowledge proof that 
the value in the commitment lies in the right range. From 
Table 1, we know that encryption takes roughly a sec- 
ond when encrypting 1024-bit number on the ARM. As 
these correspond to “medium security” in PrETP [4, Ta- 
ble 2], and our commitments and zero-knowledge proofs 
are essentially identical to theirs, we can use the relevant 
timings from PrETP to see that the total time taken for 
the Pay protocol should be at most 20 seconds per seg- 
ment. As long as the time steps are at least 20 seconds 
and the segment lengths are at least half a mile (assuming 
people drive at most 90 miles per hour), the calculations 
can therefore be done in real time. 

The second phase of computation is the end of the 
month audit protocol. Here, the OBU is responsible for 
acting as the IBE authority to answer blind extraction 
queries from the TC. As we can see in Table 1, each 
query takes the OBU approximately 175 milliseconds, 
independent of the number of segments. If the TC makes 
a small, fixed number of queries, say ten, for each vehicle, 
then the OBU will spend only a few seconds in the Audit 
protocol each month. 


TC computational costs. In the course of the Audit 
protocol, the TC has to perform a number of complex 
calculations. In particular, the cost of challenging the 
OBU for each camera is proportional to the number of 
segments the OBU reported driving. 

To obtain our performance numbers for the audit pro- 
tocol, we considered the driving habits of an average 
American, both in terms of time spent and distance driven. 
For time, we assumed that an average user would have a 
commute of 30 minutes each way, meaning one hour each 
day, in addition to driving between two and three hours 
each weekend. For distance, we assumed that an average 
user would drive around 1,000 miles each month. While 
we realize that these averages will vary greatly between 
locations (for example, between a city and a rural area), 
we believe that these measures still give us a relatively 
realistic setting in which to consider our system. 

Table 2 gives the time it takes for the TC to challenge 
the OBU on a single segment for several segment lengths 
and time steps; we can see that the time taken grows 
approximately linearly with the number of segments. 
To determine the number of segments, we considered 


they are essentially the same as they were within PrETP; the numbers 


they provide should therefore provide a reasonably accurate estimate 
for the cost of the TSP within our system as well. 
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both fine-grained and coarse-grained approaches. For the 
fine-grained approach, we considered a time step of one 
minute. Using our assumptions about driving habits, this 
means that in a 30-day month with 22 weekdays, our 
average user will drive approximately 1,320 segments. 
Adding on an extra 680 segments for weekends, we can 
see that a user might accumulate up to 2,000 segments in 
a month. In the way that road prices are currently decided, 
however, a time step of one minute seems overly short, as 
typically there are only two times considered throughout 
the day: peak and off-peak. We therefore considered next 
a time step of one hour, keeping our segment length at 
1 mile. Here the number of miles driven determines the 
number of segments rather than the minutes spent in the 
car, and so we end up with approximately 1,000 segments 
for the month. Finally, we considered a segment length 
of 2 miles, keeping our time step at one hour; we can see 
that this results in approximately half as many segments 
as before, around 500 segments. Longer average physical 
segment lengths would result in an even lower number of 
segments (and therefore better performance). 


Communication overhead. Looking at Table 3, we 
can see that the size of a payment message is approxi- 
mately 6kB per segment; furthermore, this size is domi- 
nated by the NIZK (recall that each segment requires a 
commitment, a NIZK, and a ciphertext), which accounts 
for over 90% of the total size. For our parameter choices 
in Table 2, this would result in a total payment size of 
approximately 11MB in the worst case (with 2000 seg- 
ments) and 3MB in the best case (with 500 segments). 
In PrETP, on the other hand, the authors claim to have 
sizes of only 1.5kB per segment [4, Section 4.3]. Using 
their more compact segments with our ciphertexts added 
on would therefore result in a segment size of only 2kB, 
which means the worst-case size of the entire payment 
message would be under 4MB (and the best-case size 
approximately IMB). 

Finally, we can see that the overhead for the rest of the 
Audit protocol is quite small: each blind IBE key sent 
from the OBU to the TC is only 494 bytes; if the TC 
makes ten queries per audit, then the total data transferred 
in the course of the protocol is about 5kB. 


5.1 


If we continue to assume that the TC always queries the 
user on ten cameras, then the entire auditing process will 
take less than 10 minutes per user in the worst case (when 
there are 2,000 segments) and less than 2 minutes in the 
best case (when there are 500 segments). If we consider 
pricing computational resources as according to Amazon 
EC2 [3], then to approximately match the computer used 
for our benchmarks would cost about 10 cents per hour. 
Between 6 and 30 users can be audited within an hour, so 


Milo cost analysis 
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Length Timestep Segments Time for TC (s) 
I mile 1 minute 2000 55.68 
Imile 1 hour 1000 33.51 
2miles 1 hour 500 10.45 


Table 2: The average time, in seconds and over a run of 10, for the TC to perform a single spot check given segment lengths 
and time steps; we consider only the active time spent and not the time waiting for the OBU. Essentially all of the time was 
spent iterating over the segments; as such, the time taken grows approximately linearly with the number of segments. To 
determine the approximate number of segments given segment lengths and time steps, we assumed that an average user would 
drive for 1,000 miles in a 30-day month, or about 33 hours (1 hour each weekday and an extra 11 hours over four weekends). 





Object Size (B) camera locations than they would have learned alone. Fur- 
NIZK 5455 thermore, websites already exist which record locations 
Coiantineit 130 of red light cameras [38] and speed cameras [37]; one 
Ciphertext 366 can easily imagine websites similar to these that collect 
Total Pay segment 5955 crowd-based reports of audit camera locations. With cam- 
Audit message 494 eras whose locations are fixed from month to month, the 


Table 3: Size of each of the components that needs to 
be sent between the OBU and the TC, in bytes. Each 
segment of the payment consists of a NIZK, commitment, 
and ciphertext; all the segments are forwarded to the TC 
from the TSP at the start of an audit. In the course of the 
Audit protocol the OBU must also send blind IBE keys to 
the TC. 


each user ends up costing the system between one-third 
of a cent and 2 cents each month; this is an amount that 
the TSP could easily charge the users if need be (although 
the cost would presumably be cheaper if the TC simply 
performed the computations itself). We therefore believe 
that the amount of computation required to perform the 
audits, in addition to being necessary in guaranteeing 
fairness and honesty within the system, is reasonably 
practical. 

Finally, to examine how much Milo would cost if de- 
ployed in a real population we consider the county of San 
Diego, which consists of 3 million people possessing ap- 
proximately 1.8 million vehicles, and almost 2,800 miles 
of roads [16, 17, 44]. As we just saw, Milo has a compu- 
tational cost of up to 2 cents per user per month, which 
means a worst-case expected annual cost of $432,000; in 
the best case, wherein users cost only one-third of a cent 
per month, the expected annual cost is only $72,000. In 
the next section, we can see how these costs compares 
to that of the “naive” solution to collusion protection; 
i.e., one in which we attempt to protect against driver 
collusion through placement of cameras as opposed to 
prevention and protection at the system level. 


6 Collusion Resistance 


Previously proposed tolling systems did not take collusion 
into account, as they allow the auditing authority to trans- 
mit camera locations in the clear to drivers. Given these 
locations, colluding drivers can then share their audit tran- 
scripts each month in order to learn a greater number of 
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cost to cheat is therefore essentially zero (just check the 
website!) and so we can and should expect enterprising 
drivers to take advantage of the system. In contrast, Milo 
is specifically designed to prevent these sorts of trivial 
collusion attacks. 


In addition to learning camera locations through the 
course of the audit phase, drivers may also learn camera 
locations from simply seeing them on the road. This is 
also quite damaging to the system, as drivers can learn 
the locations of cameras simply by spotting them. After 
pooling together the various locations and times at which 
they saw cameras, cheating drivers can fix up their driving 
record in time to pass any end-of-month audit protocol. 


To prevent such cheating, a system could instead re- 
quire the OBU to transmit the tuples corresponding to 
segments as they are driven, rather than all together at 
the end of the month. Without an anonymizing service 
such as Tor (used in VPriv [39]), transmitting data while 
driving represents too great a privacy loss, as the TSP 
can easily determine when and for how long each driver 
is using their car. One possible fix might seem to be to 
continually transmit dummy segments while the car is 
not in use; transmitting segments in real time over a cel- 
lular network, however, leaks coarse-grained real-time 
location information to nearby cell towers (for example, 
staying connected to a single tower for many hours sug- 
gests that you are stationary), thus defeating the main goal 
of preserving driver privacy. 


Finally, we note that there exists a class of expensive 
physical attacks targeting any real-world implementation 
of a camera-based audit protocol. For example, against 
fixed-location cameras, cheating drivers could disable 
their OBU for specific segments each month, revealing in- 
formation about those segments. Against mobile cameras, 
a driver could follow each audit vehicle and record its 
path, sharing with other cheating drivers as they go. One 
can imagine defenses against these attacks and even more 
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fanciful attacks in response; these sort of attacks quickly 
become very expensive and impractical, however, and 
provide tell-tale signs of collusion (e.g., repeated cheat- 
ing, suspicious vehicles). We therefore do not provide a 
system-level defense against them. 


6.1 Collusion resistance cost analysis 


With Milo, we have modified the PrETP system to avoid 
leaking the locations of cameras as part of the audit pro- 
tocol. An alternative approach is to leave PrETP (or one 
of the other previously proposed solutions) in place and 
increase the number of audit cameras and their mobility, 
thus reducing the useful information leaked in audits even 
when drivers collude. Whereas deploying Milo would 
increase computational costs over PrETP, deploying the 
second solution would increase the operational costs as- 
sociated with running the mobile audit cameras. In this 
section, we compare the costs associated with the two 
solutions. Even with intentionally conservative estimates 
for the operating costs of mobile audit cameras, Milo ap- 
pears to be competitive for reasonable parameter settings; 
as Moore’s law makes computation less expensive, Milo 
will become more attractive still. 

Hardening previous tolling systems against trivial 
driver collusion is possible if we consider using continu- 
ously moving, invisible cameras. Intuitively, if cameras 
move randomly, then knowing the position and time at 
which one audit camera was seen does not allow other 
cheating drivers to predict any future camera locations. 
The easiest way to achieve these random spot checks is to 
mount cameras on special-purpose vehicles, which then 
perform a random walk across all streets in the audit area. 
Even this will not generate truly random checks (as cars 
must travel linearly through streets and obey traffic laws); 
for ease of analysis we assume it does. Furthermore, we 
will make the assumptions that the audit vehicles will 
never check the same segment simultaneously, operate 
24 hours a day (every day), and are indistinguishable from 
other cars; tolling segments are 1 mile; and non-audit vehi- 
cles drive all road segments with equal probability. These 
assumptions are by no means realistic, but they present 
a stronger case for moving cameras and so we use them, 
keeping in mind that any more realistic deployment will 
have higher cost. 

Using a probability analysis similar to that of VPriv [39, 
Section 8.4], we consider an area with M miles of road 
and C audit vehicles. If both audit vehicles and other 
drivers are driving all roads uniformly at random, then 
a driver will share a segment with an audit vehicle with 
probability p = c with each mile driven. If the driver 
travels m miles in a tolling period, she will be seen at least 
once by an audit vehicle with a probability of 
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Figure 2: A cost comparison of using the Milo system 
against using mobile cameras within previously proposed 
systems. We know, from Section 5.1, that Milo has a 
worst-case computational cost of $432,000 per year and 
a best case of $72,000; for the other systems, we ignore 
computation completely (i.e., we assume it is free). Even 
with the minimal costs we have assigned to operating 
a fleet of audit vehicles 24 hours a day and assuming 
worst-case computational costs, Milo becomes equally 
cheap when the probability of catching cheating drivers is 
83%, and becomes significantly cheaper as the probability 
approaches 100%. For Milo’s best-case cost, it becomes 
cheaper as soon as more than one camera is used. 


To determine the overall cost of this type of operation, 
we return to San Diego County (discussed already in Sec- 
tion 5.1); recall that it consists of 1.8 million vehicles driv- 
ing on 2,800 miles of road, in which the average distance 
driven by one vehicle is 1,000 miles in a month. Using 
Equation 1, with one audit vehicle (C = 1), the probability 
that a driver gets caught is 1 — (2799/2800)! = .3, so 
that a potentially cheating driver still has a 70% chance 
of completely avoiding any audit vehicles for a month. If 
we use two audit vehicles, then this number drops to 49%. 
Continuing in this vein, we need 13 audit vehicles to guar- 
antee a 99% chance of catching drivers who intentionally 
omit segments. Achieving these results requires the TC 
to employ drivers 24 hours a day, as well as purchase, 
maintain, and operate a fleet of audit vehicles. To con- 
sider the cost of doing so, we estimate the depreciation, 
maintenance, and operation cost of a single vehicle to be 
approximately $12,500 a year [45]. Furthermore, Cali- 
fornia has a minimum wage of $8.00/hr; paying this to 
operate a single vehicle results in minimum annual salary 
costs of $70,080, ignoring all overtime pay and benefits. 
Each audit vehicle will therefore cost at least $82,500 per 
year (ignoring a number of additional overhead costs). 

Finally, we compare the cost of operating these mobile 
cameras with the cost of the Milo system. Because Milo 
leaks no information about camera locations to drivers, 
cameras can in fact stay at fixed locations; as long they 
are virtually invisible, drivers have no opportunities to 
learn their locations and so there is no need to move them 
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continuously. We therefore consider placing invisible 
cameras at random fixed locations, and can calculate the 
probability of drivers being caught by Milo using Equa- 
tion 1, where we now use C to represent the number of 
cameras (and continue to assume that drivers drive 1,000 
miles at random each month). 

Figure 2 compares the cost of Milo with fixed cameras 
and the cost of previous systems with mobile cameras 
as the probability of detecting cheating increases. We 
used a per-camera annual cost of $10,000. As we can 
see, in the worst case, Milo achieves cost parity with 
mobile cameras at a detection probability of 83% and 
becomes vastly cheaper as the systems approach complete 
coverage, while in the best case it achieves cost parity 
as soon as more than a single camera is used (which 
gives a detection probability of around 30%). With either 
of these numbers, we remember that our assumptions 
about the cost of operating these vehicles significantly 
underrated the actual cost; substituting in more realistic 
numbers would thus cause Milo to compare even more 
favorably. In addition, future developments in computing 
technology are almost guaranteed to drive down the cost 
of computation, while fuel and personnel costs are not 
likely to decrease, let alone as quickly. Therefore, we 
believe that Milo is and will continue to be an effective 
(and ultimately cost effective) solution to protect against 
driver collusion. 


7 Related work 


The study of privacy-preserving traffic enforcement and 
toll collection was initiated in papers by Blumberg, Keeler, 
and shelat [8] and Blumberg and Chase [7]. The former of 
these papers gave a system for traffic enforcement (such 
as red-light violations) and uses a private set-intersection 
protocol at its core; the latter gave a system for tolling 
and road pricing, and uses general secure function evalu- 
ation. Neither system keeps the location of enforcement 
or spot-check devices secret from drivers. In an impor- 
tant additional contribution, these papers formalized the 
“implicit privacy” that drivers currently enjoy: The police 
could tail particular cars to observe their whereabouts, but 
it would be impractical to apply such surveillance to more 
than a small fraction of all drivers.’ 


6This number was loosely choosen based upon purchase costs for 
red light violation cameras. Note that the choice does not affect the 
differential system cost, as both systems must operate the same number 
of cameras to achieve a given probability of success. 

7We would like to correct one misconception, lest it influence future 
researchers. Blumberg, Keeler, and shelat write, “the standards of 
suspicion necessary to stop and search a vehicle are much more lax 
than those required to enter and search a private residence.” In the 
U.S., the same standard — probable cause — governs searches of both 
vehicles and residences; the difference is only that a warrant is not 
required before the search of a car, as “it is not practicable to secure a 
warrant because the vehicle can be quickly moved out of the locality 
or jurisdiction in which the warrant must be sought” (Carroll v. United 
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Another approach to privacy-preserving road pricing 
was given by Troncoso et al. [43], who proposed trusted 
tamper-resistant hardware in each car that calculates the 
required payment, and whose behavior can be audited by 
the car’s owner. The Troncoso et al. paper also includes 
a useful survey of pay-as-you-drive systems deployed at 
the time of its publication. See Balasch, Verbauwhede, 
and Preneel [5] for a prototype implementation of the 
Troncoso et al. approach. 

De Jonge and Jacobs [19] proposed a _ privacy- 
preserving tolling system in which drivers commit to the 
path they drove without revealing the individual road seg- 
ments. De Jonge and Jacobs’ system uses hash functions 
for commitments, making it very efficient. Only additive 
road pricing functions are allowed (i.e., ones for which 
the cost of driving along a path is the sum of the cost 
of driving along each segment of the path); this makes 
possible a protocol for verifying that the total fee was 
correctly computed as the sum of each road segment price 
by revealing, essentially, a path from the root to a single 
leaf in a Merkle hash tree. (This constitutes a small infor- 
mation leak.) In addition, de Jonge and Jacobs use spot 
checks to verify that the driver faithfully reported each 
road segment on which she drove. 

More recently, Popa, Balakrishnan, and Blumberg pro- 
posed the VPriv [39] privacy-preserving toll collection 
system. VPriv takes advantage of the additive pricing 
functions it supports to enable the use of homomorphic 
commitments whereby the drivers commit to the prices 
for each segment of their path as well as the sum of the 
prices. Then, the product of the commitments is a com- 
mitment of the sum of the prices. This eliminates the need 
for a protocol to verify that the sum of segment prices 
was computed correctly. Like previous systems, VPriv 
uses (camera) spot checks to ensure that drivers faithfully 
reveal the segments they drove. The downside to VPriv 
is that, for the audit protocol, drivers must upload the 
road segments they drove to the server; to avoid linking 
these to their IP address, they must use an anonymizing 
network such as Tor. 

Balasch et al. proposed PrETP [4] to address some 
of the shortcomings in VPriv. In PrETP, drivers do not 
reveal the road segments they drove in the clear, and so do 
not need an anonymizing network. Instead, they commit 
to the segments and, using a homomorphic commitment 
scheme, to the corresponding fees; in the audit protocol, 
they open the commitments corresponding to the road 
segments on which spot-check cameras observed them. 

In each of the system of de Jonge and Jacobs, VPriv, 
and PrETP, drivers are challenged in the audit protocol 
to prove that they committed to or otherwise uploaded 
the segments for which there is photographic evidence 


States, 267 U.S. 132 (1925), at 153). 
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that they were present. As discussed in Sections | and 2, 
this revealing of camera locations enables several attacks 
which allow drivers to pay less than their actual tolls. 
Additionally, camera placement and tolling areas must be 
restricted to ensure driver privacy, for example, by using 
“virtual trip lines” [30]. 

In recent work, Hoepman and Huitema [29] observed 
that in both VPriv and PrETP the audit protocol allows 
the government to query cars about locations where there 
was no camera, a capability that could be misused, for 
example, to identify whistleblowers. They propose a 
privacy-preserving tolling system in which vehicles can 
be spot-checked only where their presence was actually 
recorded, and in which overall driver privacy is guar- 
anteed so long as the pricing provider and aggregation 
provider do not collaborate. Like VPriv, Hoepman and 
Huitema’s system requires road segments to be transmit- 
ted from the car to the authority over an anonymizing 
network. 

Besides tolling, there are other vehicular applications 
that require privacy guarantees; see, generally, Hubaux, 
Capkun, and Luo [31]. One important application is 
vehicle-to-vehicle ad hoc safety networks [14]; see Freudi- 
ger et al. [21] for one approach to location privacy in such 
networks. Another important application is aggregate traf- 
fic data collection. Hoh et al. [30] propose “virtual trip 
lines” that instruct cars to transmit their location infor- 
mation and are placed to minimize privacy implications; 
Rass et al. [40] give an alternative construction based on 
cryptographic pseudonym systems. 

Vehicle communication is one class of ubiquitous com- 
puting system. Location privacy in ubiquitous computing 
generally is a large and important research area; see the 
recent survey by Krumm [33] for references. 


8 Conclusions 


In recent years, privacy-preserving toll collection has been 
proposed as a way to implement more fine-grained pricing 
systems without having to sacrifice the privacy of drivers 
using the roads. In such systems drivers do not reveal 
their locations directly to the toll collection authorities; 
this means there needs to be a mechanism in place to guar- 
antee that the drivers are still reporting their accumulated 
fees accurately and honestly. Maintaining this balance 
between privacy and honesty in an efficient and practical 
way has proved to be a challenging problem; previous 
work, however, such as the VPriv and PrETP systems, has 
demonstrated that this problem is in fact tractable. Both 
these systems employ modern cryptographic primitives 
to allow the driver to convince the collection authority of 
the accuracy of her payment without revealing any part of 
her driving history. To go along with this collection mech- 
anism, a series of random spot checks (i.e., the authority 
challenging the driver to prove that she paid for segments 
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in which she was caught on camera) must be performed 
in order to maintain honesty and fairness throughout the 
system. 

In this paper, we have identified large-scale driver col- 
lusion as a realistic and damaging threat to the success 
of privacy-preserving toll collection systems. To protect 
against these sorts of collusions, we have presented Milo, 
a system which achieves the same privacy properties as 
VPriv and PrETP, but strengthens the guarantee of driver 
honesty by avoiding revealing camera locations to drivers. 
We have also implemented the new parts of our system to 
show that achieving this stronger security guarantee does 
not add an impractical burden to any party acting within 
the system. 

Finally, along more practical lines, we have consid- 
ered a naive approach to protecting against collusions and 
shown that, from both a cost and effectiveness considera- 
tion, it is ultimately less desirable and more cumbersome 
than Milo. 

The weaknesses we identify in previous systems are 
caused by the gap between the assumption made by 
the cryptographic protocols (that spot checks are unpre- 
dictable) and the real-world cameras used to implement 
them — cameras that are physical objects that can be iden- 
tified and may be difficult to move. If drivers are able to 
avoid some cameras, more of them will be required; if 
too many spot-check cameras are deployed, the records 
they generate will themselves degrade driver privacy. We 
believe that it is important for work on privacy-preserving 
tolling to address this limitation by carefully considering 
how the spot checks it relies on will be implemented. 
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Abstract 


Anonymizing private data before release is not enough 
to reliably protect privacy, as Netflix and AOL have 
learned to their cost. Recent research on differential 
privacy opens a way to obtain robust, provable privacy 
guarantees, and systems like PINQ and Airavat now of- 
fer convenient frameworks for processing arbitrary user- 
specified queries in a differentially private way. How- 
ever, these systems are vulnerable to a variety of covert- 
channel attacks that can be exploited by an adversarial 
querier. 

We describe several different kinds of attacks, all fea- 
sible in PINQ and some in Airavat. We discuss the space 
of possible countermeasures, and we present a detailed 
design for one specific solution, based on a new primi- 
tive we call predictable transactions and a simple differ- 
entially private programming language. Our evaluation, 
which relies on a proof-of-conceptimplementation based 
on the Caml Light runtime, shows that our design is ef- 
fective against remotely exploitable covert channels, at 
the expense of a higher query completion time. 


1 Introduction 


Privacy is a problem. Vast amounts of data about individ- 
uals is constantly accumulating in various databases— 
patient records, content and link graphs of social net- 
works, mobility traces in cellular networks, book and 
movie ratings, etc.—and there are many socially valu- 
able uses to which it can potentially be put. But, as Net- 
flix and others have discovered [3,22], even when data 
collectors try to protect the privacy of their customers by 
releasing anonymized or aggregated data, this data often 
reveals much more than intended, especially when it is 
combined with other data sources. To reliably prevent 
such privacy violations, we need to replace current ad- 
hoc solutions with a principled data release mechanism 
that offers strong, provable privacy guarantees. 

Recent research on differential privacy [8-10] has 
brought us a big step closer to achieving this goal. Dif- 
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ferential privacy allows us to reason formally about what 
an adversary could learn from released data, while avoid- 
ing many assumptions (e.g., what exactly the adversary 
might try to learn, or what he or she might already know) 
that have been the cause of privacy violations in the past. 
Early work on differentially private data analysis relied 
on manual proofs by privacy experts that the answers to 
particular queries were safe to release [21]; today, sys- 
tems like PINQ [20] and Airavat [26] can perform dif- 
ferentially private data analysis automatically, without 
needing a human expert in the loop. 

Airavat and PINQ go beyond just certifying queries by 
the data owner as differentially private; they are explic- 
itly designed to support untrusted queries over private 
databases. In this model, a third party is permitted to 
submit arbitrary queries over the database, but the data 
owner imposes a “privacy budget” that limits the amount 
of information the third party can obtain about any indi- 
vidual whose data is in the database. The system ana- 
lyzes each new query to determine its potential “privacy 
cost” and allows it to run only if the remaining balance 
on the privacy budget is sufficiently high. This mode of 
operation is attractive for many scenarios; for example, 
Netflix could give researchers access to its database of 
movie ratings via such a query interface and still give 
strong privacy assurances to customers. An adversarial 
querier could not, for instance, obtain an accurate answer 
to the query “Has John Doe watched any adult movies ?” 
because the cost of such a query would exceed any rea- 
sonable privacy budget. 

However, Airavat and PINQ both contain vulnerabili- 
ties that can be exploited by an adversary to extract pri- 
vate information through covert channels.! The reason is 
that these systems rely on the assumption that the querier 
can observe only the result of the query, and nothing else. 
In practice, however, the querier is also able to observe 
other effects of his query, such the time it takes to com- 


'The designers of these systems were aware of these covert chan- 
nels, and each addresses them to some extent. See Sections 3.5 and 3.6. 
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plete. Such observations can be exploited to mount a 
covert-channel attack. To continue with our earlier ex- 
ample, the adversary might run a query that always re- 
turns zero as its result but that takes one hour to com- 
plete if John Doe has watched adult movies and less than 
a second otherwise. Both Airavat and PINQ would con- 
sider the output of such a query to be safe because it does 
not depend on the contents of the private database at all. 
However, the adversary can still learn with perfect cer- 
tainty whether John Doe has watched adult movies—a 
blatant violation of differential privacy. PINQ’s proto- 
type implementation also permits global variables to be 
used as covert channels to leak private information dur- 
ing query execution. 

Covert channels have plagued computer systems for 
many years [1,2, 15, 16, 18,27, 30, etc.], and they are no- 
toriously difficult to avoid [7]. However, they are partic- 
ularly devastating in a system that is designed to enforce 
differential privacy: if a channel allows the adversary to 
learn even a single bit of private information, the differ- 
ential privacy guarantees are already broken! Thus, dif- 
ferential privacy puts particularly high demands on a de- 
fense against covert channels; merely limiting the band- 
width of the channels is not enough. 

Fortunately, the untrusted-query scenario has two fea- 
tures that make a solution feasible. First, there is no need 
to allow the querier direct access to the machine that 
hosts the database; he can be forced to submit queries 
and receive results over the network. This rules out diffi- 
cult channels such as power consumption [17] and elec- 
tromagnetic radiation [13,24], essentially leaving the ad- 
versary with just two channels: the privacy budget and 
the query completion time. 

Our key insight is that, in this specific scenario, these 
two channels can be closed completely through a com- 
bination of two techniques. The budget channel can be 
closed by using program analysis to statically determine 
the privacy cost of each query. Thus, the deduction from 
the privacy budget is independent of the database con- 
tents. The external timing channel can be closed by a) 
breaking each query into “microqueries” that operate on 
a single database row at a time, and by b) enforcing that 
each microquery takes a fixed amount of time. (If nec- 
essary, the microquery is aborted and a default value is 
returned. In the context of differential privacy, this is 
safe—and does not open another channel—because the 
privacy cost of the default values is already included in 
the privacy cost of the query.) Thus, we can obtain strong 
privacy assurances even if the adversary can pose arbi- 
trary queries and can observe all the (remotely measur- 
able) channels that are possible in our model. 

We present the design of Fuzz, a system that imple- 
ments this defense. Fuzz uses a novel type system [25] 
to statically infer the privacy cost of arbitrary queries 
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written in a special programming language, and it uses 
a novel primitive called predictable transactions to en- 
sure that a potentially adversarial computation completes 
within a specific time or returns a default value. We have 
built and evaluated a proof-of-concept implementation of 
Fuzz based on the Caml Light runtime system [5, 19]. 
Our results show that Fuzz effectively closes all known 
remotely exploitable channels, at the expense of a higher 
query completion time. 

Implementing predictable transactions is challenging 
in practice: Fuzz must be able to abort an arbitrary and 
potentially adversarial computation by a specified dead- 
line, even if the adversary is actively trying to cause the 
deadline to be missed, and must ensure that—whether 
or not the computation is aborted—it leaves no linger- 
ing traces that can measurably affect the program’s over- 
all execution time (garbage in the heap, VM pages that 
must later be freed by the OS, etc). Nevertheless, we 
show that, across a variety of adversarial queries that ex- 
ploit different attack strategies, our implementation ex- 
hibits extremely small variation in completion time—on 
the order of the time required to handle a single timer 
interrupt. This variation is so small that it is difficult to 
measure even on the machine itself. Thus, it would be 
useless to a remote attacker, who would have to measure 
it across a wide-area network using the limited number 
of trials that the privacy budget permits. 

In summary, we make the following contributions: 


1. a detailed analysis of several classes of covert- 
channel attacks and a discussion of which are feasi- 
ble in PINQ and Airavat (Section 3); 


2. an analysis of the space of potential solutions (4); 


3. aconcrete design for one specific solution, based on 
default values and predictable transactions (5+6); 


4. a proof-of-concept implementation of our design 
(7); and 


5. an experimental evaluation (8). 


We close with a discussion of related work and a few 
concluding thoughts. 


2 Background 


Before describing our attacks and the Fuzz design and 
implementation, we briefly review some technical back- 
ground on differential privacy, function sensitivity, and 
differentially private programming languages. 


2.1 Differential privacy 


Differential privacy [8] is a property of randomized func- 
tions that take a database as input and return a result that 
is typically some form of aggregate (a real number rep- 
resenting a count; a histogram; etc.). The database (db) 
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is a collection of “rows,” one for each individual whose 
privacy we mean to protect. 

Informally, a randomized function is differentially pri- 
vate if arbitrary changes to a single individual’s row 
(keeping other rows constant) result in only statistically 
insignificant changes in the function’s output distribu- 
tion; thus, any individual’s presence in the database has a 
statistically negligible effect. Formally [12], differential 
privacy is parametrized by a real number €, correspond- 
ing to the strength of the privacy guarantee: smaller €’s 
yield more privacy. Two databases b and b’ are consid- 
ered similar, written b ~ b’, if they differ in only one 
row. We then say that a randomized function g : db — R 
is €-differentially private if, for all possible sets of out- 
puts S CR, and for all similar databases b,b’, we have 
Pr{q(b) € S] < e®-Prig(b’) € S]. That is, when the in- 
put database is changed in one row, there is at most a very 
small multiplicative difference (e*) in the probability of 
any set of outcomes S. 

Methods for achieving differential privacy can be at- 
tractively simple—e.g., perturbing the true answer to a 
numeric query with carefully calibrated random noise. 
For example, the query “How many patients at this hos- 
pital are over the age of 40?” is intuitively “almost safe”: 
safe because it aggregates many individuals’ information 
together, but only “almost” because, if an adversary hap- 
pened to know the ages of every patient except John Doe, 
then answering this query exactly would give him certain 
knowledge of a fact about John. The differential privacy 
methodology rests on the observation that, if we add a 
small amount of random noise to this query’s result, we 
still get a useful estimate of the true answer while ob- 
scuring the age of any single individual. By contrast, the 
query “How many patients named John Doe are over the 
age of 40” is plainly problematic, since the answer is 
very sensitive to the presence or absence of a single indi- 
vidual. Such a query cannot usefully be privatized: if we 
add enough noise to mostly obscure the contribution of 
John Doe’s age, there will be essentially no signal left. 


2.2 Compositionality and privacy budgets 


An important consequence of the definition of differ- 
ential privacy is that composing a differentially private 
function with any other function that does not, itself, de- 
pend on the database yields a function that is again dif- 
ferentially private—that is, no amount of postprocessing, 
even with unknown auxiliary information, can lessen the 
differential privacy guarantee. This allows us to reason 
about harmful effects of data release that might seem 
quite far removed from the function that is actually being 
computed. 

Another important property of differential privacy is 
that its guarantee degrades gracefully under repeated ap- 
plication: a pair of two €-differentially private functions 
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is always 2€-differentially private, when taken together. 
This allows us to think of having a fixed “privacy bud- 
get” up front, which is slowly exhausted as queries are 
answered: if our privacy budget is €, we may feel free to 
independently answer k queries, where the i” query is ¢;- 
differentially private and )); €; < €, without fear that the 


aggregation of these k queries will violate ¢-differential 
privacy. 


2.3 Function sensitivity 


The central idea in proofs of differential privacy is to 
bound the sensitivity of queries to small changes in their 
inputs. Sensitivity is a kind of continuity property; a 
function of low sensitivity maps nearby inputs to nearby 
outputs. 

Sensitivity is relevant to differential privacy because 
the amount of noise required to make a deterministic 
query differentially private is proportional to its sensi- 
tivity. For example, the sensitivity of the two age queries 
discussed above is 1: adding or removing one patient’s 
records from the hospital database can change the true 
value of each query by at most 1. This means that we 
should add the same amount of noise to “How many pa- 
tients at this hospital are over the age of 40?” as to 
“How many patients named John Doe are over the age of 
40?” This may appear counter-intuitive, but it achieves 
the right goal: the privacy of single individuals is pro- 
tected to exactly the same degree in both cases. What 
differs is the usefulness of the results: knowing the an- 
swer to the first query with, say, a typical error margin of 
+100 could still be valuable if there are thousands of pa- 
tients, whereas knowing the answer to the second query 
(which can only be zero or one) +100 is useless. We 
might try making the second query more useful by scal- 
ing its answer up numerically: “Js John Doe over 40? If 
yes, then 1,000, else 0.” But this scaled query now has a 
sensitivity of 1,000, not 1, and so 1,000 times the noise 
must be added, blocking our attempt to violate privacy. 








2.4 Programming with privacy 


Early work on differential privacy has mostly focused 
on specific algorithms rather than general, compositional 
mechanisms: given a particular algorithm, we prove by 
hand that it is differentially private. Most of the time, this 
does not require much ingenuity—just applying known 
techniques—but even so, this approach doesn’t scale 
well because it demands that each new algorithm be cer- 
tified by a skilled, trusted human. A better approach is to 
automate this certification process with a programming 
language in which every well-typed program is guaran- 
teed to be differentially private. Then (untrusted) non- 
experts can write as many different algorithms as they 
like, and the database administrator can rely on the lan- 
guage to ensure that privacy is not being violated. 
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Systems are beginning to be available that implement 
such languages—notably Privacy Integrated Queries 
(PINQ) [20] and Airavat [26]. PINQ is an embedded 
extension of C# that tracks the privacy impact of vari- 
ety of relational algebra operations on database tables, as 
well as certain forms of query composition. Airavat inte- 
grates differential privacy into a distributed, Java-based 
MapReduce framework. 


2.5 Processing model 


Although PINQ and Airavat differ in many particu- 
lars, they embody essentially the same basic process- 
ing model, which we also follow in the Fuzz system de- 
scribed below. A query in each of these systems can be 
viewed as consisting of one or more mapping operations 
that process individual records in the database, together 
with some reducing code that combines the results of 
the mapping operations without directly looking at the 
database. When a query is submitted, the system verifies 
that it is ¢;-differentially private, deducts ¢; from the total 
privacy budget € associated with the database, and—if € 
remains nonzero—returns the query result. (Note that, 
in this model, we account for the possibility of collu- 
sion between adversaries by associating the privacy bud- 
get with the database and not with individual queriers. 
Thus, once the budget is exhausted, we must throw away 
the database and never answer any more queries.) We 
call the mapping operations microqueries and the rest of 
the code the macroquery. 

Airavat implements a simple version of this model: 
a query consists of a sequence of chained microqueries 
(“mappers” in Airavat terminology) plus a selection from 
among a fixed set of macroqueries (“reducers”). The 
mappers are the only untrusted code: the reducers are 
part of the trusted base. When a query is submitted, 
the adversary must also declare the expected numerical 
range of its outputs, which amounts (since its input is 
a single record of the database) to stating its sensitiv- 
ity. If the actual output ever falls outside of the declared 
range, it is clipped—in essence, the declared sensitivity 
is enforced by the system. From the declared sensitivity, 
Airavat can calculate how much noise must be added to 
the reducer’s results to achieve €-differential privacy. 

In PINQ, macroqueries are written in LINQ, a SQL- 
like declarative language, which can be embedded in oth- 
erwise unconstrained C# programs. Microqueries can 
be general C# computations (optionally constrained by 
a checker method called Purify; see Section 3.5). 


3 Attacks on differential privacy 


Naturally, database administrators may be nervous about 
offering adversaries the opportunity to run arbitrary 
queries against their raw data. They will need strong 
assurances that such adversarial queries not only play 


20th USENIX Security Symposium 


by the rules of differential privacy but also have no in- 
direct means of improperly leaking private information 
about individuals in the database. Unfortunately, this is 
not currently the case: while the authors of both PINQ 
and Airavat have anticipated the possibility of covert- 
channel attacks and have implemented either a partial 
defense (Airavat) or hooks for adding one (PINQ), both 
systems remain vulnerable to a range of attacks, as we 
now demonstrate. 


3.1 Threat model 


It is well known that covert channels are essentially 
impossible to eliminate if we allow the adversary to 
run other processes on the same computer that runs the 
query. Even if these other processes have no access to 
the database and cannot communicate directly with the 
query process, there are just too many ways for the query 
process to perturb local conditions in ways that can be 
measured fairly accurately if the observer is this close— 
e.g., processor usage, disk activity, cache pollution, etc. 
However, if we assume that the adversary is on the other 
end of a network connection, we have a much better 
chance of success. This is fortunate, since the demands 
of the situation are very strong. It is not enough to limit 
leakage to a low bandwidth or a small number of bits: 
even one bit is too much if that bit is the answer to Does 
John Doe watch adult movies? 

We therefore assume that the database and associated 
query system are hosted on a private, secure machine. 
The adversary does not have physical access to this ma- 
chine or its immediate environment (so that there is no 
way to measure its power usage, etc.) and can only com- 
municate with it over a network. The adversary submits 
arbitrary queries to the system over the network. The 
system executes each query (if it determines that doing 
so is safe) and returns the answer over the network. The 
system also maintains a privacy budget for the database 
as a whole, and it refuses to answer any more queries 
once the budget is exhausted. 

This threat model is shared by all differentially private 
query systems (PINQ, Airavat, and our Fuzz system), 
and its assumptions seem reasonable in practice. Essen- 
tially, it gives the adversary three pieces of information: 
(1) the actual answer to their query (a number, histogram, 
etc.), if any, (2) the time that the response arrives on their 
end of the network connection, and (3) the system’s deci- 
sion whether to execute their query or refuse because do- 
ing so would exceed the available privacy budget. How- 
ever, this threat model still provides plenty of room for 
attacks on privacy. We will see that, unless appropriate 
steps are taken, both the decision whether or not to ex- 
ecute a query and the execution time itself can be used 
as channels to leak private information. In essence, both 
the query’s finishing time and the fact that it is accepted 
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noisy sum, foreach r in db, of { 
if embarrassing (r) 


then { pause for 1 second }; 
return 0 





Figure 1: Timing attack example 


or refused are results that the system is giving back to 
the adversary, and we need to consider whether the com- 
bination of all results—not just the query’s numerical 
answer—is differentially private. Moreover, we will see, 
for PINQ, some ways that a malicious query may cause 
the actual answer to not be differentially private. 


3.2 Timing attacks 


Under the constraints of the above threat model, the eas- 
iest way for a query to send a bit to the adversary is by 
simply pausing for a long time (by entering an infinite 
loop, computing factorial of a million, etc.) when a cer- 
tain condition is detected in the private data, as illustrated 
(in PINQ-like pseudocode) in Figure |. The macroquery 
adds together the results of running the microquery on 
each row of the database (always 0) and finally adds 
some random noise to the total. Since almost all of the 
microquery instances finish very quickly, the distribution 
of query execution times observed by the adversary will 
change significantly when an embarrassing record exists 
in the database—a violation of differential privacy. 

A simple “microquery timeout” will not solve this 
problem, for at least two reasons. First, the adversary 
can also signal the condition by causing the query to take 
an unusually small amount of time. The simple way to 
do this is to create an exception condition that aborts the 
entire query. If this is blocked (e.g., by trapping an ex- 
ception in a microquery and replacing it with a default 
result just for that single microquery), the adversary can 
instead make all microqueries take a uniformly longish 
time (say, exactly two milliseconds) except when they 
detect the condition, in which case they terminate im- 
mediately. If the adversary happens to know exactly 
how many records are in the database, this leaks one bit. 
Second, the adversary can defeat a simple “microquery 
timeout” by causing side-effects in the microquery that 
will slow down the macroquery or other microqueries— 
for example, by allocating lots of memory to trigger 
garbage collection in the macroquery. We discuss this 
issue in more detail below. 


3.3 State attacks 


A different class of attacks involves using a channel be- 
tween microqueries, such as a global variable, to break 
differential privacy of the result, as illustrated in Figure 2. 
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found = false; 
noisy sum, foreach r in db, of { 
if (found) then { return 1 } 
if embarrassing(r) then { 
found = true; 
return i 
} else { return 0 } 


} 


Figure 2: State attack example 


noisy sum, foreach r in db, of { 
if embarrassing(r) then { 
run sub-query that uses 
a lot of privacy budget 
} else { 
return 0 


} 








} 


Figure 3: Privacy budget attack example 


This time, the result of each microquery is either 0 or 1, 
depending on whether any previous microquery detected 
an embarrassing record. Since, in general, the embar- 
rassing record will not be the last one in the database, 
this greatly magnifies the contribution of this one record 
to the result, again violating differential privacy. 


3.4 Privacy budget attack 


A related form of attack uses the query processor’s deci- 
sion whether to publicize the result of a query as a chan- 
nel for leaking private data, relying on the fact that this 
decision can be influenced by actions of the query that in 
turn depend on private data. This idea can be applied to 
systems that use a dynamic analysis to determine the ’ pri- 
vacy cost’ of a query, i.e., the amount that must be sub- 
tracted from the privacy budget before the result can be 
returned to the querier. As illustrated in Figure 3, the at- 
tack consists of looking for an embarrassing record and, 
when it is found, invoking some sub-query that will use 
up a bit of the remaining privacy budget. Once the outer 
query returns, the adversary simply checks how much the 
privacy budget has decreased. 


3.5 Case study: PINQ 


We have verified that the current PINQ implementation 
(version 0.1.1, released 08/18/09, available from [23]) is 
vulnerable to all of the above attacks. To demonstrate the 
vulnerabilities, we have written three example programs, 
each based on the test harness that comes with PINQ. 
The original test harness computes several differen- 
tially private statistics on a given text file, including the 
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Constant execution time 
Database size public 
é€-differential privacy 


Variable execution time 
Database size private 
(€, 6)-differential privacy 


Time bound analysis 
Static enforcement Exact timing analysis 
Time noise 





: Timeouts Timeouts 
Dynamic enforcement : : : 
Rounding up Time noise 


Table 1: Four approaches to the timing-channel problem. 


number of lines that contain a semicolon. When the 
program starts, it first reads the text file and creates a 
database whose rows each contain one line of text. Then 
it selects all the rows that contain a semicolon, using mi- 
croqueries with a boolean predicate p, and finally per- 
forms a noisy count on the resulting set of rows. 

Our attacks are implemented by changing the predi- 
cate p so that it produces some observable side-effects 
when the input file contains a certain string s. For the 
timing attack, we changed p so that, when invoked on a 
line that contains s, p performs an expensive computa- 
tion that takes several seconds and cannot be optimized 
out. For the state attack, we added a static variable that is 
incremented by p when it discovers s, and we write the 
(un-noised) value of this variable to the console at the 
end. For the budget attack, we added a different static 
variable that contains a reference to the database; when s 
is found, p computes a noisy count of the number of rows 
in the database, which decreases the privacy budget. 

The possibility of such attacks is acknowledged in the 
PINQ paper [20], and the PINQ implementation does 
contain hooks for an expression rewriter (called Purify 
in [20]) that is invoked on all user-supplied expressions 
and could potentially change or remove code that causes 
side-effects. However, such a rewriter is not provided; 
indeed, the PINQ downloads page contains an explicit 
warning that the code is not hardened or secured and 
should not be used ‘in the wild.’ 

We conjecture that implementing a reliable Purify 
will be far from trivial. Avoiding the privacy budget at- 
tack will probably be easiest: every function that might 
consume privacy budget could be wrapped with a check 
that raises an exception if it is called from inside a run- 
ning microquery (i.e., with a PINQ operation already on 
the call stack); this exception could then be turned into a 
default result for the microquery. State attacks are more 
difficult: since microqueries in PINQ are arbitrary bits 
of C#, it seems the choices are either to execute them on 
a modified virtual machine that detects writes to global 
state (as Airavat does), or else to create a small domain- 
specific language for writing microqueries that avoids 
global updates by design (as we do in Fuzz). Address- 
ing timing attacks will require deeper changes to PINQ: 
the issues and available solutions are precisely the ones 
we study in this paper. 
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3.6 Case study: Airavat 


Because Airavat calculates sensitivity and deducts the re- 
quired amount from the privacy budget before query ex- 
ecution begins, it is inherently safe from privacy budget 
attacks. However, Airavat’s mechanism for preventing 
state attacks permits a related vulnerability. To prevent 
microqueries from communicating via static variables, 
Airavat runs microqueries on a modified JVM; if a mi- 
croquery ever attempts to modify a static variable, an ex- 
ception is thrown and the whole query is marked “not 
differentially private.” Unfortunately, the adversary can 
now observe whether the system gives them the result at 
the end of query execution or says, “Sorry, that’s not dif- 
ferentially private.” A better alternative would be to abort 
just the microquery, return, a default result, and allow the 
remainder of the query to run to completion. 

In its published form, Airavat is also vulnerable to tim- 
ing attacks. Its authors acknowledge this weakness [26] 
but counter that the bandwidth of the channel it creates 
is very low. This, we agree, may make it tolerable in 
some contexts, e.g., with “mostly trusted” queriers that 
might be careless but will not write malicious queries 
that intentionally attempt to reveal specific targeted se- 
crets. We understand that Airavat may soon be enhanced 
to add timeouts to microquery executions [Shmatikov, 
personal communication, July 2010]; the implementa- 
tion techniques described below should be useful in this 
effort. 


4 Defending against timing attacks 


State and privacy budget attacks can (and must) be ad- 
dressed by designing the query language so that they are 
impossible. Timing attacks require more work, and this 
will be our concern for the remainder of the paper. 


4.1 Four approaches to the problem 


There are two basic strategies. One is to ensure that a 
given query takes very close to the same amount of time 
for all possible databases (of a given size—see below), 
so that the adversary can learn nothing from observing 
the time it takes the query result to arrive. The other is 
to treat time as an additional output of the query, and to 
limit the amount of information the adversary can gain 
using the same mechanisms (sensitivity analysis and ap- 


USENIX Association 


propriate perturbation) that are used for data outputs.” 
In either approach, we can either obtain the information 
about running time statically (by analyzing the program 
before running it) or enforce limits dynamically (e.g., 
by using timeouts). This gives us the four possibilities 
shown in Table 1. 

The solutions in the right-hand column provide some- 
what weaker privacy guarantees than those on the left. 
In order to properly “noise” a resource like time, we 
must have the ability to both increase and decrease its 
consumption. While we can clearly increase execution 
time by adding a delay, we cannot easily decrease it. We 
can mitigate this problem by adding a default delay T; 
thus, we can add “time noise” v > —T by delaying for 
T + v at the end of each query. Nevertheless, since noise 
distributions guaranteeing differential privacy have un- 
bounded support (i.e., P(v) > 0 for all v), there is al- 
ways a possibility that v < —T, in which case we can- 
not complete the computation. Thus, €-differential pri- 
vacy seems impossible in practice; all we can hope for 
is the slightly weaker property of (€,6)-differential pri- 
vacy [11], where 6 is a bound on the maximum additive 
(not multiplicative) difference between the probability of 
any given query output with and without a particular row 
in the input. 

On the other hand, in the constant-time solutions (left 
column), the size of the database becomes public know]l- 
edge, since, except for the most trivial queries, execution 
time depends on the size of the database. In practice, 
this is probably a reasonable concession. In the case of 
the variable-time solutions (right column), the size of the 
database does not need to be published. 

The static solutions (top row) are attractive in prin- 
ciple, but they depend on a static analysis of time 
sensitivity—something that has proved challenging ex- 
cept for very simple, inexpressive programming lan- 
guages. We therefore concentrate on the bottom row. 
In this row, we choose one column to explore further: 
the “constant execution time” alternative, where we try 
to make each microquery take as close as possible to 
exactly the same amount of time. (The other column 
also deserves exploration; we believe similar mecha- 
nisms will be required.) 


4.2 Default values 


The approach we explore in the rest of this paper is to 
dynamically ensure that each microquery m takes the ex- 
act same amount of time T. If the microquery takes less 
time to execute, we delay it and only return its result af- 
ter T. If the microquery has executed for time T without 
returning a result, we abort it. However, aborting the en- 


Note that the sensitivity analysis would have to account for inter- 
dependencies between a query’s execution time and its output value, 
which is far from trivial. 
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closing macroquery is not an option because this would 
leak information to an adversarial querier. Instead, our 
approach is to have the microquery return a default value 
d in this case. 

To avoid privacy leaks through the default value, d 
must not itself depend on the contents of the database. 
In Fuzz, a static value for d is included with the query. 
Also, for reasons that will become clear in Section 4.4, 
d should fall within the range of the microquery m. 


4.3 Do default values decrease utility? 


When the microquery for a row r times out while an- 
swering a non-adversarial query, the utility of the query’s 
overall result almost inevitably degrades. After all, the 
result no longer incorporates the intended contribution 
of r or any other row whose microquery has timed out, 
but rather uses the default value for each such micro- 
query. However, a non-adversarial querier can always 
avoid the inclusion of any default values by choosing 
a sufficiently high timeout. If the timeouts are chosen 
properly, timeouts should never occur while answering 
non-adversarial queries. Thus, the only querier who ex- 
periences degraded utility is the adversary. 

The question, then, becomes how to choose the time- 
out values. One possible method is as follows. The 
querier is supplied with a reference implementation of 
the query processor that additionally outputs the max- 
imum processing time Tig, for each microquery. The 
querier can then (locally) test his queries on arbitrary 
databases of his own construction and thus infer a rea- 
sonable time bound. The querier then adds a small safety 
margin and uses, say, 1.1 - Ting, as the timeout for his 
query. He then submits the query to the actual query 
processor, to be run on the private database. 


4.4 Do default values create privacy leaks? 


At first glance, it may appear that default values are re- 
placing one evil with another: they seem to plug the tim- 
ing channel at the expense of introducing a data channel. 
However, this is not the case: as long as the timeouts are 
applied at the microquery level (as opposed to imposing 
a timeout on the whole query), differential privacy is pre- 
served, for the following reason. 

First, recall that Fuzz is designed to ensure that the 
completion time of a query depends only on the size of 
the database, but not its contents. Since we have assumed 
that the size of the database is public, and since our threat 
model rules out all the other channels, the only remaining 
way in which private information could ‘leak’ is through 
the (noised) data that the query returns. 

Now, recall that the type system Fuzz implements 
is based on the type system from [25]. As described 
in [25], this type system ensures that all programs that 
type-check are differentially private. This is achieved by 
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inferring an upper bound on the program’s sensitivity to 
small changes in its inputs—specifically, a change to an 
individual database row. 

Fuzz extends the type system from [25] with micro- 
query timeouts on map and split, but, crucially, time- 
outs do not increase the sensitivity of these two func- 
tions. The reason is that the sensitivity of map and split 
depends on the range of values that the microquery can 
return. Since the default value is taken from the range 
of values that the microquery can already return in the 
absence of timeouts, the addition of timeouts does not 
increase this range, and thus does not increase the sensi- 
tivity either. 

Of course, running a query on a given database with 
and without timeouts (or with shorter vs. longer time- 
outs) can yield very different results. Suppose we have a 
database b and a function with microqueries that, without 
timeouts, produces an output o when it is run on b. If we 
now add a very short microquery timeout, we can easily 
cause all the microqueries to abort and return their de- 
fault value, and the resulting output for the same database 
D can be dramatically different from o. However, this 
does not mean that differential privacy is violated. Re- 
call from Section 2.1 that the differential privacy guaran- 
tee makes a statement about running the same query on 
two databases b and b’ that differ in exactly one row r. If 
we run a query with timeouts on both b and D’, the only 
microquery that could behave differently is the one on 
row r. All the other microqueries start in the same state 
for both databases, so their behavior will be exactly the 
same—they will either time out on both b and b’, or on 
neither. 


5 The Fuzz system 


Next, we present the design of the Fuzz system, which 
represents one specific point (the lower left quadrant) in 
the solution space from Table 1. This point is a good 
first step because it works with existing programming- 
language technology and is relatively easy to implement. 


5.1 Overview 


Fuzz consists of three main components: a simple pro- 
gramming language, a type checker, and a predictable 
query processor. The programming language rules out 
channels based on global state or side effects, simply by 
not supporting any primitives that could produce either. 
The type checker rules out budget-based channels by 
statically checking queries before they are executed and 
rejecting any query that cannot be guaranteed to com- 
plete with the available balance. Finally, the predictable 
query processor closes timing-based channels by ensur- 
ing that each microquery terminates after very close to 
exactly a specified amount of time. Figure 4 illustrates 
our approach. 
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Figure 4: Scenario. Queries are first type-checked by 
Fuzz and then executed in predictable time. 


5.2 Language and type system 


Fuzz queries are written in a simple functional program- 
ming language whose functionality is roughly compara- 
ble to PINQ. The Fuzz language contains a special type 
db for databases, which is not a valid return type of any 
query. We say that a primitive is critical if it takes db 
as an argument. Our language ensures that critical prim- 
itives either return other values of type db (and nothing 
else) or add noise to all of their return values. Fuzz de- 
termines the correct amount of noise to add by using the 
sensitivity analysis and type system from [25]. 

Fuzz currently supports four critical primitives (Ta- 
ble 2): map applies a function f to each row in one 
database and returns the results in another database; 
split applies a boolean predicate p to each row in a 
database and returns two databases, one with all rows r 
for which p(r) = TRUE and the other with the rest; count 
returns the (noised) number of rows in a database; and 
sum returns the (noised) sum of all the rows. sum’s type 
ensures that it can only be applied to databases with nu- 
meric rows. 


5.3. Predictable query processor 


To close timing channels, the query processor must en- 
sure that all critical primitives take a predictable amount 
of time that depends only on the size of the database. 
This is trivial for sum and count. However, map and 
split involve arbitrary microqueries, and it can be diffi- 
cult to statically analyze how much time these will take. 

To avoid the need for such an analysis, Fuzz instead 
relies on predictable transactions. A predictable trans- 
action is a primitive P-TRANS(A,a,T,d), where A is a 
function, a an argument, T a timeout, and d a default 
value. P-TRANS takes exactly time T, and returns A (a) if 
A terminates within time 7, or d otherwise. Note that an 
implementation of P-TRANS may have to (a) add a delay 
if A terminates early, and (b) abort A slightly before T 
expires to ensure that any resources allocated by A can 
be released in time. In Section 6, we describe two ap- 
proaches to implementing P-TRANS in practice. 

When evaluating map or split, Fuzz invokes p-TRANS 
for each microquery, using the specified timeout T and— 
in the case of map—the specified default value (split 
has an implicit default of TRUE). 
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Primitive Arguments Return value 


map db f Td 
split db pT 
count db 
sum db 


Database db, function f, timeout T, default value d 


Database db, boolean predicate p, timeout T 
Database db 
Database db 


Database 

Two databases 
Noised |db| 
Noised );; db; 





Table 2: Critical primitives in the Fuzz language 


All values of type db internally have representations 
of the same size, i.e., they consume the same amount 
of memory and (conceptually) have the same number 
of rows as the original database. If necessary, they are 
padded with dummy rows. For example, if the original 
database has 1,000 rows and consumes | MB of mem- 
ory, the two databases returned by a sp1it both consume 
1 MB, and an invocation of map on either of them will 
invoke 1,000 microqueries—though of course the results 
of microqueries on dummy rows will be discarded. 


5.4 How Fuzz protects privacy 


We now briefly summarize how Fuzz protects against 
covert channels. First, the only observations a querier 
can make that depend on the contents of the database are 
the completion time of the query and its return value. 
This is because of (a) our threat model from Section 3.1, 
(b) the fact that the language contains no primitives with 
side-effects, such as mutating global state, and (c) the 
fact that the type system rules out abnormal termination. 

Second, the return value of the query is differentially 
private. Since db is not a valid return type and critical 
operations return only values of type db or else appropri- 
ately noised values (based on the sensitivity that has been 
statically inferred [25]), the return value cannot depend 
on non-noised values from the database directly. Also, 
the language does not contain any primitives for observ- 
ing side-effects within the query, such as memory con- 
sumption or the current wallclock time. The only time- 
related primitives are the timeouts on the microqueries; 
these have a sensitivity of | because (a) each microquery 
operates on only one row from the database at a time, and 
(b) microqueries have no access to global state and there- 
fore cannot communicate with one another. Thus, if we 
add or remove one individual’s data from the database, 
this affects only one row, so this can only cause one more 
(or less) microquery to time out and add a default result 
to the output. 

Third, the completion time of a query depends only 
on the size of the database (which we assumed to be 
public) and data that has already been noised. To see 
why, consider that the only operations that have access to 
non-noised data are the microqueries, for which Fuzz en- 
forces a constant runtime (by aborting or padding them 
to their timeout), and that values of type db cannot af- 
fect the control flow directly, only indirectly through re- 
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turn values of critical operations, which are noised. It 
is perfectly OK for the completion time of a query to de- 
pend on noised data, since such data is safe to release and 
could even have been returned to the querier directly. 

In summary, Fuzz is designed to ensure that everything 
observable by the querier—whether directly through the 
data channel or indirectly through the timing channel— 
either does not depend on the contents of the database or 
has been noised appropriately. 


6 Implementation strategies 


In this section, we describe the abstract requirements for 
implementing predictable transactions, and we propose 
two concrete implementation strategies: one for newly 
designed runtimes (6.2) and one for retrofitting Fuzz into 
an existing runtime (6.3). Naturally, we expect the for- 
mer to be more efficient and the latter to be easier to im- 
plement. 


6.1 Requirements 


To implement p-TRANS(A,a,T,d), the following three 
properties need to hold for the language runtime: 


e Isolation: 4 (a) can be executed without interfering 
with the succeeding computation in any way, apart 
from contributing its return value. 


e Preemptability: The execution of A(a) can be 
aborted at any time, or at most within some time 
bound A; 


e Bounded deallocation: At any point during the ex- 
ecution of A(a), there is a upper bound Aq on the 
time needed to deallocate all resources allocated so 
far by A(a). 


If these requirements hold, we can implement P-TRANS 
by running A (a) in isolation and setting a timer to T — 
Aa — Ag (which must be updated when Ay changes due to 
new allocations). If the timer fires, we can abort A and 
deallocate its resources without overrunning the overall 
timeout 7. After a final delay to reach T exactly, we 
can return either the result of A(a) if we have it, or d 
otherwise. 


6.2 White-box approach 


If we design a new language runtime from scratch, or 
if we are willing to make extensive changes to an exist- 
ing runtime, we can achieve isolation and preemptability 
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by avoiding global variables that could be left in an in- 
consistent state when a microquery is aborted, as well 
as any termination of the microquery that does not cor- 
rectly return the default value. Thus, it becomes possible 
to abort a microquery simply by performing a longjmp 
or its equivalent. 


Regarding bounded deallocation, we expect that the 
key resource in most cases will be memory. It is possi- 
ble to design the memory allocator in such a way that 
the memory allocated by a microquery can be deallo- 
cated in constant time. For example, we can divert the 
allocator from its usual allocation pool while a micro- 
query is in progress, and instead allocate memory from 
a special region dedicated to microqueries. If the micro- 
query takes arguments and returns results by value rather 
than by reference, objects in the main heap cannot ac- 
quire references to this region, so it is safe to summarily 
deallocate the entire region when the microquery aborts 
or terminates. 


6.3 Black-box approach 


The first strategy assumes a fairly deep understanding of 
how all primitive operations of the language are imple- 
mented, and how they interact with the allocator and each 
other. If we are working with an existing runtime sys- 
tem, it may be hard to be sure that the entire rest of the 
state of memory outside the microquery allocation region 
has been restored to its original state after a microquery 
finishes; for example, if we use any off-the-shelf library 
functions, they may have local buffers or other global 
state through which information can leak. 


In this case, we can still ensure isolation and preempt- 
ability by leveraging operating system support, e.g., by 
farming out microqueries to a separate process, which 
can then be destroyed at any time without interfering 
with the state of the main runtime. Bounded dealloca- 
tion can be achieved if we know an upper bound on the 
amount of time the operating system needs to destroy a 
process. 


7 Proof-of-concept implementation 


Next, we describe our proof-of-concept implementation 
of Fuzz. Our implementation does not execute Fuzz pro- 
grams directly; rather, we implemented a front-end that 
accepts Fuzz programs, typechecks them, and then (if 
successful) translates them into Caml programs. Thus, 
we did not need to implement an entire language run- 
time from scratch; it was sufficient to implement a library 
with Fuzz-specific primitives like map and split, and to 
extend an existing runtime with support for predictable 
transactions. We chose Caml because it is similar enough 
to Fuzz to make the translation relatively straightforward. 
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7.1 Background: Caml Light 


Our implementation is based on Caml Light [5, 19] ver- 
sion 0.75, a stable and lightweight implementation of 
Caml. Here, we briefly describe only the aspects of Caml 
Light that are relevant for our discussion of Fuzz. For a 
detailed description of Caml Light, please see [19]. 

In Caml Light, Caml code is first compiled into byte- 
code for an abstract machine called ZAM (the ZINC ab- 
stract machine); this bytecode is then executed on a run- 
time that implements the ZAM. Because of this archi- 
tecture, the actual ZAM runtime is relatively simple: it 
mainly consists of an interpreter for the ZAM instruc- 
tions and some code for I/O, memory management, and 
garbage collection. 

The state of the ZAM consists of a code pointer, a reg- 
ister holding the current environment, an accumulator, 
two stacks (an argument stack and a return stack) and the 
heap. The heap is divided into two zones: a fixed-size 
‘young’ zone and a variable-size ‘old’ zone. Most ob- 
jects are initially allocated in the young zone; when this 
zone fills up, a ‘minor’ garbage collection copies any ob- 
jects that remain active into the old zone. This was orig- 
inally done to reduce the frequency of ‘major’ garbage 
collection runs (since most objects are short-lived, their 
space can be reclaimed very quickly), but it is also very 
convenient for Fuzz, as we shall see below. 

Note that Fuzz uses the ZAM runtime to run only pro- 
grams that it has previously translated from Fuzz pro- 
grams. Thus, we can safely ignore features of the ZAM 
runtime (such as reference cells) that Fuzz does not use. 
Our threat model assumes that the adversary can submit 
only Fuzz programs, so he or she is unable to access any 
of these features. 


7.2 Bounded deallocation 


When a microquery times out, Fuzz must be able, within 
a bounded amount of time, to release all of the resources 
the microquery may have allocated. To this end, our im- 
plementation performs a minor collection at the begin- 
ning of each macroquery, which clears the young zone 
of the heap, and it confines any additional memory al- 
locations during microqueries to the young zone. Thus, 
we can simply discard the entire young zone after each 
microquery, which requires only a single instruction. If 
the microquery completes normally (without a timeout), 
it writes its result into a special fixed-size buffer that is 
not part of young zone. If this buffer is empty after the 
microquery or contains only a partial result, the macro- 
query uses the default value instead. 

Discarding the entire young zone is safe because, after 
a microquery, there cannot be any outside references to 
objects in that zone. Any new memory allocations must 
be in the young zone, any new values on the stacks are 
discarded as well, and the only objects in the old zone 
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that could be modified in place are reference cells, which 
translated Fuzz programs cannot use. Note that discard- 
ing the young zone is faster than a minor collection, so 
this particular modification (which is only possible for 
Fuzz programs, not for arbitrary Caml programs) actu- 
ally results in a speedup. 


7.3. Preemptability 


Fuzz must be able to preempt a running microquery af- 
ter a specified time, with high precision. To this end, our 
implementation creates a second thread that continuously 
spins on the CPU’s timestamp counter (TSC).? When a 
microquery is started, the interpreter sets a shared vari- 
able to the time at which the preemption should occur; 
when that point is reached, the second thread sends a sig- 
nal to the interpreter thread. To prevent the two threads 
from slowing each other down, each is pinned to a dif- 
ferent CPU core. If the microquery terminates before the 
timeout, it simply spins until the preemption occurs. 

Preemptions can occur at arbitrary points in the run- 
time code. To avoid inconsistencies, our implementation 
checkpoints all mutable state before each microquery; 
when the signal is raised, it uses Long jmp to return to the 
macroquery and then restores the runtime state from the 
checkpoint. We exclude from the checkpoint any state 
that either is immutable or is discarded anyway — includ- 
ing both zones of the heap and any existing values on the 
stacks. This leaves just a handful of variables, such as 
the ZAM’s stack pointers and the code pointer. 


7.4 Isolation 


Fuzz must ensure that a microquery cannot interfere with 
the rest of the computation in any way, other than con- 
tributing its return value. In the previous two sections, 
we have already seen that the states of the ZAM runtime 
before and after a microquery are logically equivalent, 
since any changes (other than the result value) are either 
discarded or rolled back. To avoid direct timing inter- 
ference between microqueries, Fuzz also pads the run- 
time of the preemption code to Ag + Ag. However, Fuzz 
must also avoid indirect timing interference through the 
garbage collector, or from the rest of the system. 

Fuzz prevents data-dependent invocations of the 
garbage collector by padding all database rows to con- 
sume the same amount of memory, and by padding all 
database objects to have the same number of rows. For 
databases that result from a split, Fuzz adds an appro- 
priate number of dummy rows that consume memory and 
computation time but do not contribute to the result. Fuzz 
also disables the garbage collector during microqueries; 
if a microquery attempts to allocate more space than is 


3There are many other ways of implementing preemptions, such as 
periodic TSC checks in the interpreter loop, or using the CPU’s perfor- 
mance counters. 
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available in the young zone of the heap, Fuzz stops it and 
forces it to time out. Thus, from the perspective of the 
macroquery (and the garbage collector), memory usage 
does not depend on un-noised values from the database. 

To prevent page faults and context switches, Fuzz pre- 
allocates and pins all of its memory pages, and it as- 
signs itself a real-time scheduling priority. In our experi- 
ments, this was sufficient to control the timing variations 
to within a a few microseconds. 


7.5 Implementation effort 


Altogether, we added or modified 6,256 lines of code, in- 
cluding 4,887 lines of C++ for the typechecker/translator, 
1,119 lines of C++ and Caml code for our implemen- 
tation of predictable transactions, 186 lines of C++ for 
benchmarking support, and 64 lines of Fuzz code for 
common library functions. For comparison, the entire 
Caml Light codebase consists of 29,984 lines of code. 
This supports our claim that Fuzz can be retrofitted into 
existing runtimes. 


7.6 Limitations 


Despite all our precautions, some potential sources of 
variability remain. For example, our current implemen- 
tation does not freeze or flush the CPU’s caches (since in- 
structions like wbinvd are not available from user level), 
and it is designed to run on a commodity Linux kernel. 
We believe that these sources would be difficult to exploit 
because the adversary cannot control the memory lay- 
out or force the runtime to invoke system calls; also, any 
exploitable variation would have to be large enough to 
cause the A, + Ay padding to be overrun. An implemen- 
tation with at least some kernel support could remove 
some or all of these sources, and thus use a less conser- 
vative padding. 


8 Evaluation 


Our evaluation has two primary goals. First, we need 
to demonstrate that Fuzz is practical, in the sense that 
it is sufficiently fast and expressive to process realis- 
tic queries. Second, we need to demonstrate that our 
Fuzz implementation is effective, i.e., that it prevents all 
the covert-channel attacks that are possible in our threat 
model (Section 3.1). 


8.1 Non-adversarial queries 


To demonstrate that Fuzz is powerful enough to support 
useful queries, we implemented three example queries 
that were motivated in prior work [4, 6, 12]. The weblog 
query is intended to run on the log of an Apache web 
server; it computes a histogram of the number of web 
requests that came from specific subnets. The kmeans 
query clusters a set of points and returns the three cluster 
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Name LoC 
kmeans 


census 


Type Inspired by 
Clustering 
Aggregation 


Histogram 


weblog 





Table 3: Examples of non-adversarial Fuzz queries. 


centers, and the census query runs on census data and 
reports the income differential between men and women. 

Table 3 reports the lines of code needed for each query. 
The queries are small because programmers only need to 
specify the actual data processing; parsing and I/O are 
handled by Fuzz. Also, the queries use a small library of 
generic primitives, such as lists and a fold operator, that 
consists of 64 lines written directly in the Fuzz language. 
Note that Fuzz can automatically certify queries as dif- 
ferentially private and perform sensitivity analysis dur- 
ing typechecking, so even non-experts can easily write 
differentially private queries. 


8.2 Experimental setup 


To evaluate the performance and effectiveness of Fuzz, 
we performed experiments using a setup consistent with 
our model from Section 3.1. We installed Fuzz on a ded- 
icated machine, a Dell Optiplex 780 with a 3.06 Ghz In- 
tel Core 2 Duo E7600 processor and 4 GB of memory. 
The machine was running a 32-bit Ubuntu Linux 11.04 
with a 2.6.38-8 kernel. For our timing measurements, 
we used the CPU’s timestamp counter, which is cycle- 
accurate. To minimize interference, we disabled CPU 
power management and the flush daemon, we kept all 
mutable data in a ramdisk and mounted all other file sys- 
tems read-only, and we terminated all other processes on 
the machine, leaving Fuzz as the only running process 
(recall our assumption that the machine is dedicated to 
Fuzz). As discussed in Section 7.6, there are sources of 
timing variability that we could not disable, such as the 
periodic timer interrupt, which takes about 3 Us to han- 
dle in this setup, but these cannot be influenced by an 
adversary, so they merely add noise to the query comple- 
tion time without leaking information. The padding time, 
which corresponds to A, + Ag, was set to 10 ps; this set- 
ting was chosen to be the highest preemption latency we 
observed, plus a generous safety margin. 

To estimate the overhead of our implementation, we 
also prepared a version of the three translated Fuzz 
queries that can run on the original Caml Light runtime. 
Since the original runtime does not support P-TRANS or 
a fixed-size memory representation for databases, this 
required small modifications to the Caml code; for ex- 
ample, the modified queries invoke microqueries with- 
out any timeouts, and they keep the database in ordinary 
Caml lists. These modifications do not affect the data 
output of the queries. We used the modified Caml code 
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Figure 5: Performance for non-adversarial queries. 


only for experiments with the original Caml Light run- 
time; all other experiments directly use the Caml code 
that is output by the Fuzz front-end. 


8.3. Macrobenchmarks 


To estimate the performance of Fuzz, we ran each of the 
example queries from Table 3 over a synthetic dataset 
and measured the query completion time. Using syn- 
thetic data rather than real private data does not affect 
our measurements because, by design, the completion 
time does not depend on the contents of the database. 
However, the data format was based on realistic data— 
specifically, the weblog input was based on an Apache 
server log and the census input was based on U.S. census 
data from [14]. The synthetic database in each case had 
10,000 rows. We set the microquery timeouts for each 
map and split by first running the query over example 
data with timeouts and padding disabled, measuring the 
maximum time taken by any of the map or split’s mi- 
croqueries, and then setting the timeout to be 10% above 
that. We verified that no timeouts occurred during our 
measurements. 

Figure 5 shows the query completion time for three 
different configurations: the original Caml Light run- 
time, the Fuzz runtime with both timeouts and padding 
disabled, and the Fuzz runtime with all features en- 
abled. As expected, Fuzz takes more time to com- 
plete the queries than the original runtime; for our three 
queries, the slowdown was between 2.5x (census) and 
6.8x (kmeans). However, in absolute terms, the com- 
pletion times were not unreasonable: the most expensive 
query (kmeans) took 12.7s to complete, which seems low 
enough to be practical. 

Figure 5 also shows that, with timeouts and padding 
disabled, Fuzz’s performance is roughly comparable to 
that of the original Caml Light runtime. This is not an 
apples-to-apples comparison; for example, the fixed-size 
memory representation for databases costs performance, 
whereas erasing the young zone after each microquery is 
actually faster than garbage-collecting it. Nevertheless, 
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Figure 6: Time spent in different phases of query pro- 
cessing. 


the numbers suggest that most of the overhead comes 
from padding and timeouts. Next, we examine this in 
more detail. 


8.4 Microbenchmarks 


To get a better picture of what factors influence the per- 
formance of our implementation, we added instrumenta- 
tion in such a way that query time can be attributed to 
one of the following five phases: 


e P1: Computation performed by a microquery; 


e P2: Waiting for the preemption when a microquery 
completes early; 


e P3: Preemption handling, storing results, restoring 
checkpoints, and loading the next row; 


e P4: Padding the time of the preemption handler to 
Aqg+Aq; and 


e P5: Computation performed by the macroquery. 


Figure 6 shows our results (we omit the time P5 taken 
by the macroquery because it was below 0.2% of the to- 
tal for all queries). As already suggested by the previous 
section, the majority of the time is spent in either the 
waiting or the padding phase. This may seem rather con- 
servative at first, but recall that the completion time of 
even a non-adversarial microquery can vary with the row 
it is processing; the timeout needs to be sufficient for the 
longest query with high probability. Timeout handling, 
deallocation, checkpointing, and storing the results takes 
comparatively little time. 

Note that the overhead for the kmeans query is con- 
siderably higher than for the others. This is because 
kmeans repeatedly uses split to partition the database — 
specifically, to map each point to the nearest of the three 
cluster centers. Since our proof-of-concept implementa- 
tion is not keeping track of the fact that the union of the 
three partitions contains exactly the N rows in the orig- 
inal database, it must conservatively assume that each 
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Figure 7: Variation of completion time for the weblog- 
delay query. 


partition might contain all the N rows. Thus, functions 
that operate on the partitions are padded to 3- N times 
the timeout, when in fact N times would be sufficient. 
This could be avoided by extending Fuzz with a suitable 
operator, e.g., a GroupBy as in PINQ. 


8.5 Adversarial queries 


As explained in Section 5.4, Fuzz rules out state attacks 
and privacy budget attacks by design, and it prevents tim- 
ing attacks by enforcing that each microquery takes pre- 
cisely the time specified by its timeout. This last point 
cannot be perfectly achieved by a practical implementa- 
tion running on real hardware; we need to quantify how 
close our implementation comes to this goal. 

To this end, we implemented five adversarial queries, 
exploiting different variants of the attacks from Section 3 
to try to vary the completion time based on whether or 
not some specific individual is in the database: 


e weblog-delay adds an artificial delay in each micro- 
query that finds a match; 


e weblog-term adds an artificial delay except when a 
microquery finds a match; 


e weblog-mem consumes a lot of memory when a 
matching individual is found; 


e weblog-ge creates a lot of garbage on the heap by 
repeatedly allocating and releasing memory; 


e census-delay looks for a particular known person in 
the database and adds a timing delay if their income 
is above a specified threshold. 


We ran each query on two versions of the corresponding 
database: one that contains the individual (Hit) and an- 
other that does not (Miss). To demonstrate the effective- 
ness of these attacks on an unprotected system, we first 
performed the experiment with Fuzz runtime and then 
repeated it with the original Caml Light runtime. This 
gives us four configurations per query. We ran 100 trials 


20th USENIX Security Symposium 519 


520 


Attack type 


1.961 s 
1.567 s 
1.621 s 
26.378 s 
2.168 s 


weblog-mem | Memory allocation 
Garbage creation 
Artificial delay 
Early termination 


Artificial delay 


weblog-gc 
weblog-delay 
weblog-term 
census-delay 


Caml Light runtime (not protected) Fuzz runtime (protected) 
0.317 s 
0.318 s 
0.318 s 
26.384 s 


0.897 s 





Table 4: Effect of various attacks without and with predictable transactions. Each adversarial query tries to vary 
its completion time based on whether some specific individual is in the database. We show the total macroquery 
processing times when the individual is present (hit) and absent (miss), as well as the differences. 


for each configuration, after a warm-up phase of two tri- 
als to ensure that the Fuzz binary and the database were 
in the file system caches. 


Figure 7 shows how the completion times varied 
across the 100 trials, using the weblog-delay query with 
the Miss database as an example. With the original 
runtime, the completion times varied by approximately 
+150 ws around the median. With the Fuzz runtime, 
the completion times are extremely stable: the difference 
between maximum and minimum was <1 ps. The re- 
sults for the other queries were similar, indicating that 
Fuzz’s padding mechanism successfully masks internal 
variations between trials. Hence, we only report median 
values here. 





Table 4 shows our results for the different configura- 
tions. We make the following three observations. First, 
the attacks are very effective when protections are dis- 
abled. For four out of the five queries, the completion 
times for the Hit cases were at least one second different 
from the completion times for the Miss cases, so an ad- 
versarial querier could easily have distinguished between 
the two cases and thus learned with certainty whether or 
not the individual was in the database. We could have 
achieved even higher differences simply by changing the 
queries. For weblog-term, the difference was only a 
few milliseconds; the reason is that, in order to change 
the completion time of the query by one second through 
early termination, the adversary would have had to make 
each microquery take at least one second, so the overall 
query would have taken a conspicuously long time — in 
this case, nearly three hours. 

Second, the attacks cease to be effective in Fuzz. In 
each case, the difference between Hit and Miss is so 
small we could not even reliably measure it locally on 
the machine (for comparison, handling a timer interrupt 
requires about 3 ys, and one hundred of these are trig- 
gered every second, limiting the achievable accuracy), 
much less across a wide-area network, using the small 
number of trials that the privacy budget allows. 

Third, the completion times are higher when protec- 
tions are enabled. This is consistent with our earlier ob- 
servations from Section 8.3. 
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8.6 Summary 


Our results show that Fuzz is effective: it eliminates state 
and budget channels by design, and narrows the timing 
channel to a point where it ceases to be useful to an ad- 
versary. Query completion times remain practical but are 
substantially higher than in an unprotected system. 


9 Related Work 


Differential privacy: There is a considerable body of 
work on the theory of differential privacy [8—10] and 
on differentially private data analysis [20,26]. Except 
for the papers on Airavat [26] and PINQ [20], none of 
these papers discuss covert-channel attacks by adversar- 
ial queriers. The PINQ paper briefly mentions certain 
security issues, such as exceptions and non-termination; 
Airavat discusses timing channels, but, as we have shown 
in Section 3.5, its defense is not fully effective. The 
present paper complements existing work by providing 
a practical defense against covert-channel attacks, which 
could be applied to existing systems. 
Covert channels: Covert channels have plagued sys- 
tems for decades [18,30], and they are notoriously hard 
to avoid in general. Fuzz is a domain-specific solution; 
it only addresses differentially private query processing, 
but it can give strong assurances in this specific setting. 
A variety of defenses against covert channels have 
been suggested. Most related to this paper is the work 
on external timing channels. The bandwidth of external 
timing channels can be reduced, e.g., by adding random 
delays [15, 16] or by time quantization [2]. However, to 
guarantee differential privacy, the adversary must be pre- 
vented from learning even a single bit of private infor- 
mation with certainty, so a mere reduction in bandwidth 
is not sufficient in our setting. Fuzz avoids this problem 
by converting the timing channel into a storage channel, 
which in turn is handled by differential privacy. 
Preventing timing channels seems hopeless in the gen- 
eral case. Language-based designs can eliminate them 
for certain types of programs [1], but only at the expense 
of severely limiting the expressiveness of the program- 
ming language. Shroff and Smith [27] show how to han- 
dle more general computations but may have to abort 
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them, which can result in garbled data and/or leak in- 
formation through a storage channel. In the context of a 
differentially private query, however, aborting individual 
microqueries is safe because the impact on the overall 
result is known to be bounded by the sensitivity of the 
query. As shown in Section 4.4, returning default val- 
ues does not open a new storage channel or increase the 
privacy cost of the query (though it may decrease its use- 
fulness). 

Side channels: Side channels can leak private informa- 
tion, e.g., through electromagnetic radiation [13,24] or 
power consumption [17]. Many of these channels can 
only be exploited if the adversary is physically close to 
the machine that executes the queries, which is not per- 
mitted by our threat model. 

Real-time systems: Some real-time systems have pro- 
visions for handling timer overrun problems in untrusted 
code, such as preemption or partial admission [29]. In 
our scenario, it would not be sufficient to simply preempt 
a microquery that has overshot its timeout—we must be 
able to terminate it and clean up all of its side effects be- 
fore the timeout expires. Another approach is inferring 
the worst-case execution time [28], which is known to be 
difficult even for trusted code. 


10 Conclusion 


We have demonstrated that state-of-the-art systems for 
differentially private data analysis are vulnerable to sev- 
eral different kinds of covert-channel attacks from adver- 
sarial queriers. Covert channels are particularly danger- 
ous in this context because the leakage of even a single 
bit of private, un-noised information completely destroys 
the guarantees these systems are designed to provide. We 
analyzed the space of potential solutions, and we pre- 
sented the design of Fuzz, which represents one specific 
solution from this space and relies on default values and 
predictable transactions. Using a proof-of-concept im- 
plementation based on Caml Light, we demonstrated that 
Fuzz can be retrofitted into an existing language runtime. 
Our evaluation shows that Fuzz is practical and expres- 
sive enough to support realistic queries. Fuzz increases 
query completion times compared to systems without 
covert-channel defenses, but the increase does not seem 
large enough to prevent practical applications. 
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Outsourcing the Decryption of ABE Ciphertexts 
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Abstract 


Attribute-based encryption (ABE) is a new vision for 
public key encryption that allows users to encrypt and 
decrypt messages based on user attributes. For example, 
a user can create a ciphertext that can be decrypted only 
by other users with attributes satisfying (“Faculty” OR 
(“PhD Student” AND “Quals Completed’’)). Given its 
expressiveness, ABE is currently being considered for 
many cloud storage and computing applications. How- 
ever, one of the main efficiency drawbacks of ABE is that 
the size of the ciphertext and the time required to decrypt 
it grows with the complexity of the access formula. 

In this work, we propose a new paradigm for ABE that 
largely eliminates this overhead for users. Suppose that 
ABE ciphertexts are stored in the cloud. We show how 
a user can provide the cloud with a single transformation 
key that allows the cloud to translate any ABE ciphertext 
satisfied by that user’s attributes into a (constant-size) El 
Gamal-style ciphertext, without the cloud being able to 
read any part of the user’s messages. 

To precisely define and demonstrate the advantages of 
this approach, we provide new security definitions for 
both CPA and replayable CCA security with outsourc- 
ing, several new constructions, an implementation of our 
algorithms and detailed performance measurements. In a 
typical configuration, the user saves significantly on both 
bandwidth and decryption time, without increasing the 
number of transmissions. 
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1 Introduction 


Traditionally, we have viewed encryption as a method 
for one user to encrypt data to another specific targeted 
party, such that only the target recipient can decrypt and 
read the message. However, in many applications a user 
might often wish to encrypt data according to some pol- 
icy as opposed to specified set of users. Trying to realize 
such applications on top of a traditional public key mech- 
anism poses a number of difficulties. For instance, a user 
encrypting data will need to have a mechanism which 
allows him to look up all parties that have access creden- 
tials or attributes that match his policy. These difficul- 
ties are compounded if a party’s credentials themselves 
might be sensitive (e.g., the set of users with a TOP SE- 
CRET clearance) or if a party gains credentials well after 
data is encrypted and stored. 

To address these issues, a new vision of encryption 
was put forth by Sahai and Waters [38] called Attribute- 
Based Encryption (ABE). In an ABE system, a user will 
associate an encryption of a message M with an function 
f(-), representing an access policy associated with the 
decryption. A user with a secret key that represents their 
set of attributes (e.g., credentials) S and will be able to 
decrypt a ciphertext associated with function f(-) if and 
only if f(S) = 1. Since the introduction of ABE there 
have been several other works proposing different vari- 
ants [24, 7, 14, 36, 23, 42, 15, 28, 35] extending both 
functionality and refining security proof techniques. ! 

One property that all of these ABE systems have is 
that both the ciphertext size and time for decryption grow 
with the size of the access formula f. Roughly, cur- 
rent efficient ABE realizations are set in pairing-based 
groups where the ciphertexts require two group elements 
for every node in the formula and decryption will require 


1A more general concept of functional encryption [11] allows for 
more general functions to be computed on the encrypted data and en- 
compasses work such as searching on encrypted data and predicate en- 
cryption [10, 2, 12, 39, 27]. 
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Scheme ABE Security Model Full CT Full Decrypt Out CT Out Dec 
Type Level Size Ops Size Ops 
Waters [42] CP CPA - |Gr|+(1+28)|G| < (2+0)P+2¢Eg - - 
§3.1 CP CPA - |Gr|+(1+28)|G| < (2+0)P+2lEg 2|Gr Er 
§3.2 CP RCCA RO IGr|+(14+20)|G|+k <(24+0)P+20Eg+2Er 2\|Gr|+k 3Er 
GPSW [24] KP CPA - |Gr|+ (1 +s)|G| < (1+ 0)P+2¢Eg - - 
84.1 KP CPA - |Gr|+ (1 +s)|G| <(1+0)P+2¢Eg 2|Gr Er 
§4.2 KP RCCA RO IGr|+(4+s)|G|+k <(1+0)P4+20EG+2Er 2|Gr|+k 3Er 























Figure 1: Summary of ABE outsourcing results. Above s denotes the size of an attribute set, @ refers to an LSSS access 
structure with an ¢ x n matrix, k is the message bit length in RCCA schemes, and P, Eg, Ey stand for the maximum time 
to compute a pairing, exponentiation in G and exponentiation in Gr respectively. We ignore non-dominant operations. 
All schemes are in the selective security setting. We discuss methods for moving to adaptive security in Section 5.1. 


a pairing for each node in the satisfied formula. While 
conventional desktop computers should be able to handle 
such a task for typical formula sizes, this presents a sig- 
nificant challenge for users that manage and view private 
data on mobile devices where processors are often one to 
two orders of magnitude slower than their desktop coun- 
terparts and battery life is a persistent problem. Interest- 
ingly, in tandem there has emerged the ability for users 
to buy on-demand computing from cloud-based services 
such as Amazon’s EC2 and Microsoft’s Windows Azure. 

Can cloud services be securely used to outsource de- 
cryption in Attribute-Based Encryption systems? A 
naive first approach would be for a user to simply hand 
over their secret key, SK, to the outsourcing service. 
The service could then simply decrypt all ciphertexts re- 
quested by the user and then transmit the decrypted data. 
However, this requires complete trust of the outsourc- 
ing service; using the secret key the outsourcing service 
could read any encrypted message intended for the user. 

A second approach might be to leverage recent out- 
sourcing techniques [20, 17] based on Gentry’s [21] fully 
homomorphic encryption system. These give outsourc- 
ing for general computations and importantly preserve 
the privacy of the inputs so that the decryption keys and 
messages can remain hidden. Unfortunately, the over- 
head for these systems is currently impractical. Gentry 
and Halevi [22] showed that even for weak security pa- 
rameters one “bootstrapping” operation of the homomor- 
phic operation would take at least 30 seconds on a high 
performance machine (and 30 minutes for the high se- 
curity parameter). Since one such operation would only 
count for a small constant number of gates in the overall 
computation, this would need to be repeated many times 
to evaluate an ABE decryption using the methods above. 

Closer to practice, we might leverage recent tech- 
niques on secure outsourcing of pairings [16]. These 
techniques allow a client to outsource a pairing operation 
to a server. However, the solutions presented in [16] still 
require the client to compute multiple exponentiations in 
the target group for every pairing it outsources. These ex- 
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ponentiations can be quite expensive and the work of the 
client will still be proportional to the size of the policy 
jf. Moreover, every pairing operation in the original pro- 
tocol will trigger four pairings do be done by the proxy. 
Thus, the total workload is increased by a factor of at 
least four from the original decryption algorithm, and the 
client’s bandwidth requirements may actually increase. 
Given these drawbacks, we aim for an ABE outsourcing 
system that is secure and imposes minimal overhead. 


Our Contributions. We give new methods for effi- 
ciently and securely outsourcing decryption of ABE ci- 
phertexts. The core change to outsourceable ABE sys- 
tems is a modified Key Generation algorithm that pro- 
duces two keys. The first key is a short E] Gamal [19] 
type secret key that must be kept private by the user. The 
second is what we call a “transformation key”, TK, that 
is shared with a proxy (and can be publicly distributed). 
If the proxy then receives a ciphertext CT for a func- 
tion f for which the user’s credentials satisfy, it is then 
able to use the key TK to transform CT into a simple and 
short El Gamal ciphertext CT’ of the same message en- 
crypted under the user’s key SK. The user is then able to 
decrypt with one simple exponentiation. Our system is 
secure against any malicious proxy. Moreover, the com- 
putational effort of the proxy is no more than that used to 
decrypt a ciphertext in a standard ABE system. 

To achieve our results, we create what we call a new 
key blinding technique. At a high level, the new out- 
sourced key generation algorithm will first run a key gen- 
eration algorithm from an existing bilinear map based 
ABE scheme such as [24, 42]. Then it will choose a 
blinding factor exponent z € Z, (for groups of prime or- 
der p) and raise all elements to z~! (mod p). This will 
produce the transformation key TK, while the blinding 
factor z can serve as the secret key. 

We show that we are able to adapt our outsourcing 
techniques to both the “Ciphertext-Policy” (CP-ABE) 
and “‘Key-Policy” (KP-ABE) types of ABE systems.” To 


2CP-ABE systems behave as we outlined above where a ciphertext 
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Figure 3: Outsourcing the Decryption: Illustration of 
how ABE ciphertexts could be transformed by a proxy 
into much shorter El Gamal-style ciphertexts. 


achieve our KP-ABE and CP-ABE outsourcing systems 
we respectively apply our methodology to the construc- 
tions of Goyal et al. [24] and Waters [42]. To prove se- 
curity of the systems we must show that they remain se- 
cure even in the presence of an attacker that acts as a 
user’s proxy. Our first systems and proofs model seman- 
tic security for an attacker that tries to eavesdrop on the 
user. We then extend our systems and proofs to chosen 
ciphertext attacks where the attack might query the user’s 
decryption routine on maliciously formed ciphertexts to 
compromise privacy. Our solutions in this setting apply 
the random oracle heuristic to achieve efficiency near the 
chosen plaintext versions. 


Typical Usage Scenarios. We envision a typical usage 
scenario in Figures 2 and 3. Here a client sends a single 
transformation key once to the proxy, who can then re- 
trieve (potentially large) ABE ciphertexts that the user is 
interested in and forward to her (small, constant-size) El 
Gamal-type ciphertexts. The proxy could be the client’s 
mail server, or the ciphertext server and the proxy could 
be the same entity, as in a cloud environment. 

The savings in bandwidth and local computation time 
for the client are immediate: a transformed ciphertext 
is always smaller and faster to decrypt than an ABE ci- 
phertext of [24, 42] (for any policy size). We emphasize 
in this useage scenario that the number of transmissions 
will be the same as in the prior (non-outsourced) solu- 
tions. Thus, the power consumption can only improve 
with faster computations and smaller transmissions. 


Implementation and Evaluation. To evaluate our out- 
sourcing systems, we implemented the CP-ABE version 


is associated with a boolean access formula f and a user’s key is a set of 
attributes x, where a user can decrypt if f(x) = 1. KP-ABE is useful in 
applications where we want to have the mirror image semantics where 
the attributes x are associated with a ciphertext and an access formula 
Ff with the key. 
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and tested it in an outsourcing environment. Our imple- 
mentation modified part of the libfenc [25] library, which 
includes a current CP-ABE implementation. We con- 
ducted our experiments on both an ARM-based mobile 
device and an Intel server to model the user device and 
proxy respectively. 

Outsourcing decryption resulted in significant practi- 
cal benefits. Decrypting on an ABE ciphertext contain- 
ing 100 attributes, we found that without the use of a 
proxy the mobile device would require about 30 seconds 
of computation time and drain a significant amount of 
the device’s battery. When we applied our outsourcing 
technique, decrypting the ciphertext took 2 seconds on 
our Intel server and approximately 60 milliseconds on 
the mobile device itself. 

To demonstrate compatibility with existing infrastruc- 
ture, we constructed a re-usable platform for outsourcing 
decryption using the Amazon EC2 service. Our proxy is 
deployed as a public Amazon Machine Image that can be 
programmatically instantiated by any application requir- 
ing acceleration. 

In addition to the core benefits of outsourcing, we dis- 
covered other collateral advantages. In existing ABE im- 
plementations [6, 25] much of the decryption code is 
dedicated to determining how a policy is satisfied by a 
key and executing the corresponding pairing computa- 
tions of decryption. In our outsourcing solution, most 
of this code is pushed into the untrusted transformation 
algorithm, leaving only a much smaller portion on the 
user’s device. This has two advantages. First, the amount 
of decryption code that needs to reside on a resource con- 
strained user device will be smaller. Actually, all bilinear 
map operations can be pushed outside. Second, this par- 
titioning will dramatically decrease the size of the trusted 
code base, removing thousands of lines of complex pars- 
ing code. Even without using outsourcing, this partition- 
ing of code is useful. 


Related Work: Proxy Re-Encryption. In this work, we 
show how to delegate (in a true offline sense) the ability 
to transform an ABE ciphertext on message m into an 
El Gamal-style ciphertext on the same m, without learn- 
ing anything about m. This is similar to the concept of 
proxy re-encryption [8, 4] where an untrusted proxy is 
given a re-encryption key that allows it to transform an 
encryption under Alice’s key of m into an encryption un- 
der Bob’s key of the same m, without allowing the proxy 
to learn anything about m. 


2 Background 
We first give the security definitions for ABE with out- 


sourcing. We then give background information on bi- 
linear maps. Finally, we provide formal definitions for 
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access structures and relevant background on Linear Se- 
cret Sharing Schemes (LSSS), as taken from [42]. 


Types of ABE. We consider two distinct varieties 
of Attribute-Based Encryption: Ciphertext-Policy (CP- 
ABE) and Key-Policy (KP-ABE). In CP-ABE an access 
structure (policy) is embedded into the ciphertext during 
encryption, and each decryption key is based an some 
attribute set S. KP-ABE inverts this relationship, embed- 
ding S into the ciphertext and a policy into the key.? We 
capture both paradigms in a generalized ABE definition. 


2.1 Access Structures 


Definition 1 (Access Structure [5]) Let {P,, Po, ..., P,} 
be aset of parties. A collection A C QtPi Pant is mono- 
tone if YB,C : if BE Aand BCC thenC € A. An access 
structure (respectively, monotone access structure) is a 
collection (resp., monotone collection) A of non-empty 
subsets of {P,,P2,...,Py}, ie, A C 24? PaPah\ fp}, 
The sets in A are called the authorized sets, and the sets 
not in A are called the unauthorized sets. 


In our context, the role of the parties is taken by the 
attributes. Thus, the access structure A will contain the 
authorized sets of attributes. We restrict our attention to 
monotone access structures. However, it is also possible 
to (inefficiently) realize general access structures using 
our techniques by defining the “not” of an attribute as 
a separate attribute altogether. Thus, the number of at- 
tributes in the system will be doubled. From now on, 
unless stated otherwise, by an access structure we mean 
a monotone access structure. 


2.2 ABE with Outsourcing 


Let S represent a set of attributes, and A an access struc- 
ture. For generality, we will define (Jenc,Jkey) as the in- 
puts to the encryption and key generation function re- 
spectively. In a CP-ABE scheme (lenc, key) = (A,S), 
while in a KP-ABE scheme we will have (Jenc,Ikey) = 
(S,A). A CP-ABE (resp. KP-ABE) scheme with out- 
sourcing functionality consists of five algorithms: 


Setup(A,U). The setup algorithm takes security param- 
eter and attribute universe description as input. It outputs 
the public parameters PK and a master key MK. 


Encrypt(PK, M, [.,-). The encryption algorithm takes 
as input the public parameters PK, a message M, and an 


3More intuitively, CP-ABE is often suggested as a means to imple- 
ment role-based access control, where the user’s key attributes corre- 
spond the long-term roles and ciphertexts carry an access policy. Key- 
Policy ABE is more appropriate in applications where ciphertexts may 
be tagged with attributes (e.g., relating to message content), and each 
user’s access to these ciphertexts determined by a policy in their de- 
cryption key. For more on applications, see e.g., [37]. 
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access structure (resp. attribute set) Ign-. It outputs the 
ciphertext CT. 


KeyGengy:(MK, [ey). + The key generation algorithm 
takes as input the master key MK and an attribute set 
(resp. access structure) J; and outputs a private key SK 
and a transformation key TK. 


Transform(TK,CT). The ciphertext transformation al- 
gorithm takes as input a transformation key TK for Jey 
and a ciphertext CT that was encrypted under J¢,-. It out- 
puts the partially decrypted ciphertext CT’ if S € A and 
the error symbol _L otherwise. 


Decrypt,,;(SK,CT’). The decryption algorithm takes as 
input a private key SK for xe, and a partially decrypted 
ciphertext CT’ that was originally encrypted under Ienc. 
It outputs the message M if S € A and the error symbol 
 otherwise.* 


Why RCCA security? We describe a security model for 
ABE that support outsourcing. We want a very strong 
notion of security. The traditional notion of security 
against adaptive chosen-ciphertext attacks (CCA) is a bit 
too strong since it does not allow any bit of the cipher- 
text to be altered, and the purpose of our outsourcing is 
to compress the size of the ciphertext. We thus adopt 
a relaxation due to Canetti, Krawczyk and Nielsen [13] 
called replayable CCA security, which allows modifica- 
tions to the ciphertext provided they cannot change the 
underlying message in a meaningful way. 


RCCA Security Model for ABE with Outsourcing. Fig- 
ure 4 describes a generalized RCCA security game for 
both KP-ABE and CP-ABE schemes with outsourcing. 
We define the advantage of an adversary © in this game 
as Pr[b’ = b] — 5. 


Definition 2 (RCCA-Secure ABE with Outsourcing) 
A CP-ABE or KP-ABE scheme with outsourcing is 
RCCA-secure (or secure against replayable chosen- 
ciphertext attacks) if all polynomial time adversaries 
have at most a negligible advantage in the RCCA game 
defined above. 


CPA Security. We say that a system is CPA-secure (or 
secure against chosen-plaintext attacks) if we remove the 
Decrypt oracle in both Phase 1 and 2. 


Selective Security. We say that a CP-ABE (resp. KP- 
ABE) system is selectively secure if we add an Init stage 
before Setup where the adversary commits to the chal- 
lenge value 7... 


4Note that we can implement the standard (non-outsourced) ABE 
Decrypt algorithm by combining Transform and Decryptoyr. 
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Setup. The challenger runs the Setup algorithm and gives the public parameters, PK to the adversary. 


Phase 1. The challenger initializes an empty table 7, an empty set D and an integer j = 0. Proceeding adaptively, 
the adversary can repeatedly make any of the following queries: 


e Create(e,): The challenger sets j := j+ 1. Itruns the outsourced key generation algorithm on xe, to obtain the 
pair (SK, TK) and stores in table T the entry (j, ey, SK, TK). It then returns to the adversary the transformation 
key TK. 

Note: Create can be repeatedly queried with the same input. 


e Corrupt(i): If there exists an jth entry in table 7, then the challenger obtains the entry (i, Tey, SK,TK) and sets 


D := DU {ey}. It then returns to the adversary the private key SK. If no such entry exists, then it returns _. 


e Decrypt(i,CT): If there exists an i” entry in table 7, then the challenger obtains the entry (i, Ikey, SK, TK) and 
returns to the adversary the output of the decryption algorithm on input (SK,CT). If no such entry exists, then 
it returns L. 


Challenge. The adversary submits two equal length messages Mo and M . In addition the adversary gives a value 
[ye Such that for all Ikey € Dy f (key; Ténc) # 1. The challenger flips a random coin b, and encrypts M, under J;,,.. The 
resulting ciphertext CT is given to the adversary. 


Phase 2. Phase | is repeated with the restrictions that the adversary cannot 


e trivially obtain a private key for the challenge ciphertext. That is, it cannot issue a Corrupt query that would 
result in a value [xey which satisfies f (key, [znc) = 1 being added to D. 

@ issue a trivial decryption query. That is, Decrypt queries will be answered as in Phase 1, except that if the 
response would be either Mp or M), then the challenger responds with the special message test instead. 


Guess. The adversary outputs a guess D’ of b. 











Figure 4: Generalized RCCA Security game for CP- and KP-ABE with outsourcing functionality. For CP-ABE we 
define the function f(Ikey,lenc) aS f(S,A) and for KP-ABE it is defined as f(A,S). In either case the function f 
evaluates to | iff S € A. 


2.3 Bilinear Maps 2.4 Linear Secret Sharing Schemes 


=e : We will make essential use of linear secret-sharing 
Let G and Gr be two multiplicative cyclic groups of schemes. We adapt our definitions from those in [5]: 


prime order p. Let g be a generator of Gande:Gx G— 

Gr be a bilinear map with the properties: Definition 3 (Linear Secret-Sharing Schemes (LSSS) ) 
A secret-sharing scheme II over a set of parties F is 
called linear (over Zp) if 

1. Bilinearity: for all u,v € G and a,b € Zp, we have 
e(u,v?) = e(u,v)”. 

2. Non-degeneracy: e(g,g) # 1. 2. There exists a matrix M with € rows and n columns 

called the share-generating matrix for 11. There ex- 

ists a function p which maps each row of the matrix 


1. The shares of the parties form a vector over Zp. 


We say that G is a bilinear group if the group opera- to an associated party. That is fori =1,...,@ the 
tion in G and the bilinear map e : G x G > Gr are both value p (i) is the party associated with row i. When 
efficiently computable. we consider the column vector v = (S,12,---;Tn); 

where s € Ly is the secret to be shared, and 


The schemes we present in this work are provably 
secure under the Decisional Parallel BDHE Assump- 
tion [42] and the Decisional Bilinear Diffie-Hellman as- 
sumption (DBDH) [9] in bilinear groups. For reasons 
of space we will omit a definition of these assumptions It is shown in [5] that every linear secret sharing- 
here, and refer the reader to the cited works. scheme according to the above definition also enjoys the 


12,+++,%m © Zp are randomly chosen, then Mv is the 
vector of € shares of the secret s according to II. 
The share (Mv); belongs to party p(i). 
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linear reconstruction property, defined as follows: Sup- 
pose that IT is an LSSS for the access structure A. Let 
S € A be any authorized set, and let Jc {1,2,...,£} be 
defined as J = {i: p(i) € S}. Then, there exist constants 
{@; € Zp hier such that, if {A;} are valid shares of any se- 
cret s according to IT, then )'j-;@;A; = s. It is shown 
in [5] that these constants {@;} can be found in time 
polynomial in the size of the share-generating matrix M. 
Like any secret sharing scheme, it has the property that 
for any unauthorized set S ¢ A, the secret s should be 
information theoretically hidden from the parties in S. 


Note on Convention. We use the convention that vector 
(1,0,0,...,0) is the “target” vector for any linear secret 
sharing scheme. For any satisfying set of rows J in M, 
we will have that the target vector is in the span of I. 

For any unauthorized set of rows J the target vector is 
not in the span of the rows of the set 7. Moreover, there 
will exist a vector w such that w-(1,0,0...,0) = —1 and 
w:-M;=0 for alli € J. 


Using Access Trees. Some prior ABE works (e.g., [24]) 
described access formulas in terms of binary trees. Using 
standard techniques [5] one can convert any monotonic 
boolean formula into an LSSS representation. An access 
tree of € nodes will result in an LSSS matrix of @ rows. 


3 Outsourcing Decryption for Ciphertext- 
Policy ABE 


3.1 A CPA-secure Construction 


Our CP-ABE construction is based on the “large uni- 
verse” construction of Waters [42], which was proven 
to be selectively CPA-secure under the Decisional q- 
parallel BDHE assumption for a challenge matrix of size 
¢* x n*, where *,n* < q.> The Setup, Encrypt and (non- 
outsourced) Decrypt algorithms are identical to [42]. To 
enable outsourcing we modify the KeyGen algorithm to 
output a transformation key. We also define a new Trans- 
form algorithm, and modify the decryption algorithm to 
handle outputs of Encrypt as well as Transform. We 
present the full construction in Figure 5. 


Discussion. For generality, we defined the transfor- 
mation key TK as being created by the master author- 
ity. However, we observe that our outsourcing approach 
above is actually backwards compatible with existing de- 
ployments of the Waters system. In particular, one can 
see that any existing user with her own Waters SK can 
create a corresponding outsourcing pair (SK’,TK’) by 
rerandomizing with a random value z. 


5By “large universe”, we mean a system that allows for a super- 
polynomial number of attributes. 
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Theorem 3.1 Suppose the large universe construction 
of Waters [42, Appendix C] is a selectively CPA-secure 
CP-ABE scheme. Then the CP-ABE scheme of Figure 5 
is a selectively CPA-secure outsourcing scheme. 


Note that the Waters scheme of [42] was proven secure 
under the Decisional g-parallel BDHE assumption. Due 
to space constraints, we omit a proof of Theorem 3.1. 
However, we observe that the proof techniques are quite 
similar to those used for the RCCA-secure variant we 
present in the next section. 


3.2. An RCCA-secure Construction 


We now extend our CPA-secure system to achieve the 
stronger RCCA-security guarantee. To do so, we borrow 
some techniques from Fujisaki and Okamoto [18], who 
(roughly) showed how to transform a CPA-secure en- 
cryption scheme into a CCA-secure encryption scheme 
in the random oracle model. Here we relax to RCCA- 
security and have the additional challenge of preserving 
the decryption outsourcing capability. 

The Setup and KeyGen algorithms operate exactly as 
in the CPA-secure scheme, except the public key addi- 
tionally includes the description of hash functions Hy : 
{0,1}* — Zp, and Hp : {0,1}* > {0,1}*. We now de- 
scribe the remaining algorithms. 


Encryptyeca(PK,.W@ € {0,1}*,(M,p)) The encryption 


algorithm selects a random R € Gr and then com- 
putes s = H\(R,.@) and r = H2(R). It then computes 
(Ci,D1),...,(Ce,De) as in the CPA-secure construction 
of Figure 5 (except that s is no longer chosen randomly 
as part of Vv). The ciphertext is published as CT = 


C=R-e(g,g)™, C=", C'=469r, 
(C1,D1),..-,(Ce,De) 


along with a description of access structure (M,p). 


Transform,¢ca(TK,CT). The transformation algorithm 
recovers the value e(g,g)°*/< as before. It outputs the 
partially decrypted ciphertext CT’ as (C,C”,e(g,g)°*/<). 


Decryptycca(SK,CT). The decryption algorithm takes 
as input a private key SK = (z, TK) and a ciphertext CT. 
If the ciphertext is not partially decrypted, then the algo- 
rithm first executes Transform,,;(TK,CT). If the output 
is L, then this algorithm outputs | as well. Otherwise, it 
takes the ciphertext (Zo, 71,72) and computes R = T/T}, 
M =T; ®H2(R), and s =H, (R,.@). If To = R-e(g,g)™ 
and T) = e(g,g)%*/¢, it outputs .@; otherwise, it outputs 
the error symbol L. 
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Setup(A,U). The setup algorithm takes as input a security parameter and a universe description U. To cover the most general 
case, we let U = {0,1}*. It then chooses a group G of prime order p, a generator g and a hash function F that maps {0, 1}* 
to G.“ In addition, it chooses random exponents a@,a € Zp. The authority sets MSK = (g“,PK) as the master secret key. It 
publishes the public parameters as: 


PK=  g, e(g,8)", 8°, F 


Encrypt(PK,.@,(M,p)) The encryption algorithm takes as input the public parameters PK and a message -@ to encrypt. In 
addition, it takes as input an LSSS access structure (M,p). The function p associates rows of M to attributes. Let M be an 
xn matrix. The algorithm first chooses a random vector ¥ = (8, y2,..-,¥n) € Z,- These values will be used to share the 
encryption exponent s. For i = 1 to @, it calculates A; = ¥- M;, where Mj is the vector corresponding to the ith row of M. In 
addition, the algorithm chooses random r1,...,r¢ € Zp. The ciphertext is published as CT = 


C=. -e(g,2)%, C= 8%, 
(Cy =g™ - F(p(1))-", Dy =8"),...,(Cp = 9™- F(p(0))-, Dy = 8") 


along with a description of (M,p). 


KeyGen,i;(MSK,S) The key generation algorithm runs KeyGen(MSK,S) to obtain SK’ = (PK,K’ = g%g™ ,L! = g" {Ki = 
F(x)" }xes). It chooses a random value z € Z>,. It sets the transformation key TK as 


PK, Kak UE = ge/) eld) — gloidee, paolkaghM ag, {Kes ={ ees 


and the private key SK as (z,TK). 


Transform ,;(TK,CT) The transformation algorithm takes as input a transformation key TK = (PK, K,L,{Ky}yes) for a set S 
and a ciphertext CT = (C,C’,C},...,C¢) for access structure (M,p). If S does not satisfy the access structure, it outputs |. 
Suppose that S satisfies the access structure and let J C {1,2,...,£} be defined as J = {i: p(i) € S}. Then, let {@; € Zp hier 
be a set of constants such that if {A;} are valid shares of any secret s according to M, then )j<; @jA; = s. The transformation 
algorithm computes 


(C.K) / (e(Tier P-L) Tier eB? Kp(i)) = 
e(s,)"*/e(g,8)/ (icre(s.g)'%) = e(g,8)°% 


It outputs the partially decrypted ciphertext CT’ as (C,e(g,g)°@/*), which can be viewed as the El Gamal ciphertext (.@ - 
G4, G7) where G = e(g,g)'/- € Gr andd =sa@ € Zp. 


Decrypt,,,;(SK,CT) The decryption algorithm takes as input a private key SK = (z, TK) and a ciphertext CT. If the ciphertext is 
not partially decrypted, then the algorithm first executes Transform ,;(TK, CT). If the output is L, then this algorithm outputs 
L as well. Otherwise, it takes the ciphertext (7,7) and computes To/Tj = .@. 
Notice that if the ciphertext is already partially decrypted for the user, then she need only compute one exponentiation and no 
pairings to recover the message. 





“See Waters [42] for details on how to implement this hash in the standard model. For our purposes, one can think of F as a random oracle. 











Figure 5: A CPA-secure CP-ABE outsourcing scheme based on the large-universe construction of Waters [42, Ap- 
pendix C]. 
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Theorem 3.2 Suppose the large universe construction 
of Waters [42, Appendix C] is a selectively CPA-secure 
CP-ABE scheme. Then the outsourcing scheme above is 
selectively RCCA-secure in the random oracle model for 
large message spaces.® 


We present a proof of Theorem 3.2 in Appendix A. 


4 Outsourcing Decryption for Key-Policy 
ABE 


4.1 A CPA-secure Construction 


We now present an outsourcing scheme based on the 
large universe KP-ABE construction due to Goyal, 
Pandey, Sahai and Waters [24].’ The Setup and Encrypt 
algorithms are identical to [24]. We modify KeyGen to 
output a transformation key, introduce a Transform algo- 
rithm, and then modify the decryption algorithm to han- 
dle outputs of Encrypt as well as Transform. The full 
construction is presented in Figure 6. 


Theorem 4.1 Suppose the GPSW KP-ABE scheme [24] 
is selectively CPA-secure. Then the KP-ABE scheme of 
Figure 6 is a selectively CPA-secure outsourcing scheme. 


Discussion. As in the previous construction, we defined 
the transformation key TK as being created by the master 
authority. We again note that our outsourcing approach 
above is actually backwards compatible with existing de- 
ployments of the GPSW system. 

Due to restrictions on space, we leave the proof of se- 
curity to the full version of this work [26]. 


4.2 An RCCA-secure construction 


We now extend our above results, which only hold for 
CPA-security, to the stronger RCCA-security guarantee. 
Once again, we accomplish this using the techniques 
from Fujisaki and Okamoto [18]. The Setup and Key- 
Gen algorithms operate exactly as before, except the pub- 
lic key additionally includes the value e(g,h)% (which 
was already computable from existing values) and the 
description of hash functions H; : {0,1}* — Z, and 
Hy : {0,1}* > {0,1}. 


©The security of this scheme follows for large message spaces; e.g., 
k-bit spaces where k > A, the security parameter. To obtain a secure 
scheme for smaller message spaces, replace C’ with any CPA-secure 
symmetric encryption of .@ using key H2(R) and let the range of Hz be 
the key space of this symmetric scheme. Since the focus of this work is 
on efficiency, we’ll typically be assuming large enough message spaces 
and therefore opting for the quicker XOR operation. 

7This construction was originally described using access trees; here 
we generalize it to LSSS access structures. 
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Encryptyeca(PK,-W € {0,1}*,S). The encryption al- 
gorithm chooses a random R € Gr. It then computes 
s = H,(R,-@) and r = H(R). For each x € S it gener- 
ates C, as in the CPA-secure scheme. The ciphertext is 
published as CT = 


C=R -e(g,h)™, cs 2", CS roM, {Cy }xes 
along with a description of S. 


Transform,¢ca(TK,CT). The transformation algorithm 
recovers the value e(g,h)°“/< as before. It outputs the 
partially decrypted ciphertext CT’ as (C,C”,e(g,h)*“/). 


Decryptyeca(SK,CT). The decryption algorithm takes 
as input a private key SK = (z, TK) and a ciphertext CT. 
If the ciphertext is not partially decrypted, then the algo- 
rithm first executes Transform,,;(TK,CT). If the output 
is L, then this algorithm outputs | as well. Otherwise, it 
takes the ciphertext (To, 7,7) and computes R = Ty /T;, 
M =T; ®H2(R), and s = Hi (R,.@). If To = R-e(g,h)™ 
and T) = e(g,h)%*/%, it outputs .@; otherwise, it outputs 
the error symbol _L. 


Theorem 4.2 Suppose the construction of GPSW [24] 
is a selectively CPA-secure KP-ABE scheme. Then the 
outsourcing scheme above is selectively RCCA-secure in 
the random oracle model for large message spaces. 


See the footnote on Theorem 3.2 for a definition and dis- 
cussion of “large message spaces”. We present a proof 
of Theorem 4.2 in the full version [26] of this work. 


5 Discussion 


5.1 Achieving Adaptive Security 


The systems we presented were proven secure in the se- 
lective model of security. We briefly sketch how we can 
adapt our techniques to achieve ABE systems that are 
provably secure in the adaptive model.® 

Recently, the first ABE systems that achieved adap- 
tive security were proposed by Lewko ef al. [28] using 
the techniques of Dual System Encryption [41]. Since 
the underlying structure of the KP-ABE and CP-ABE 
schemes presented by Lewko et al. is almost respectively 
identical to the underlying Goyal et al. [24] and Wa- 
ters [42] systems we use, it is possible to adapt our con- 
struction techniques to these underlying constructions.” 


8We briefly note that it is simple to prove adaptive security of our 
schemes in the generic group model like Bethencourt, Sahai, and Wa- 
ters [7]. Here we are interested in proofs under non-interactive assump- 
tions. 

°The main difference in terms of the constructions is that the sys- 
tems proposed by Lewko et al. are set in composite order groups where 
the “core scheme” sits in one subgroup. The primary novelty of their 
work is in developing adaptive proofs of security for ABE systems. 
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Setup(A,U). The setup algorithm takes as input a security parameter and a universe description U. To cover the most general 
case, we let U = {0,1}*. It then chooses a group G of prime order p, a generator g and a hash function F that maps {0, 1}* 
to G.* In addition, it chooses random values @ € Zp and h € G. The authority sets MSK = (a, PK) as the master secret key. 
The public key is published as 

PK= g, 9%, h, F 


Encrypt(PK,.@,S). The encryption algorithm takes as input the public parameters PK, a message .@ to encrypt, and a set of 
attributes S. It chooses a random s € Z,. The ciphertext is published as CT = (S,C) where 


C= A -e(g,h)™, C= 8", {Ce =F (x) hres. 


KeyGenoi: (MSK, (M,p)). Parse MSK = (@,PK). The key generation algorithm runs KeyGen((a@, PK), (M,p)) to obtain SK’ = 
PK, (D! =h4 . F(p(1))",R =2"1),...,(D/,,R,)). Next, it chooses a random value z € Z,, computes the transformation ke 
1 p 1=8 oy P Pp y 
TK as below, and outputs the private key as (z, TK). Denoting r//z as r;, TK is computed as: 


PK, (D, = Dj =W4/2. F(p(1))", Ri = Ry = 8"), ... ,(De= Dy”, Re = Ry) 


Transform ,;(TK,CT). The transformation algorithm takes as input a transformation key TK = (PK, (D),R1),...,(De,R,)) for 
access structure (M,p) and a ciphertext CT = (C,C’, {Cy }xcs) for set S. If S does not satisfy the access structure, it outputs 
_L. Suppose that S satisfies the access structure and let J C {1,2,...,€} be defined as J = {i: p(i) € S}. Then, let {@; € Zp hier 
be a set of constants such that if {A;} are valid shares of any secret s according to M, then )<; @jA; = s. The transformation 
algorithm computes 


eC. J]D%)/ (THee:c@) = ole [[ MO. Fpny')/ (Teer) 


iced iced icl ie 


= e(g,h)°*/*. TT e(*, F(p(i))")/ (Teron) = (gh)? 


ic] ic] 


It outputs the partially decrypted ciphertext CT’ as (C,e(g,h)°@/<), which can be viewed as the El Gamal ciphertext (.@ - 
G4,G%) where G = e(g,h)!/2 € Gr andd =sae€ Zp. 


Decryptou:(SK,CT). The decryption algorithm takes as input a private key SK = (z,TK) and a ciphertext CT. If the ciphertext is 
not partially decrypted, then the algorithm first executes Transform ,;(TK, CT). If the output is |, then this algorithm outputs 
_L as well. Otherwise, it takes the ciphertext (Tp, T;) and computes Ty /T; = -@. 





“Goyal et al. [24] give a standard model instantiation for F using an n-wise independent hash function (in the exponents) with the restriction 
that any ciphertext can contain at most n attributes. For our purposes, one can think of F as a random oracle. 











Figure 6: A CPA-secure KP-ABE outsourcing scheme based on the large-universe construction of Goyal, Pandey, 
Sahai and Waters [24]. 
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Figure 7: Architecture and data flow for our cloud-based outsourcing proxy. An application programmatically instan- 
tiates one or more instances of the outsourcing proxy, which is loaded from a public Amazon Machine Image (AMI) 
in the $3 storage cloud. Next the application uploads a transform key TK to the proxy, and subsequently instructs 
the proxy to obtain ciphertexts from remote web servers or from locations within the S3 storage cloud. The proxy 
transforms the ciphertexts and returns the partially-decrypted result to the application, which completes decryption to 
obtain a plaintext. We emphasize that the setup step including uploading the transformation key only needs to be done 
once; subsequently, many decryption steps can follow. In an alternative configuration (not shown) the application can 
also upload ABE ciphertexts to the proxy from its local storage. We note the first configuration conflates the ciphertext 
delivery and partial decryption and thus requires no additional transmissions relative to non outsourcing solutions. The 
alternative will require an round trip for each outsourcing operation. 


One might hope that the proof of adaptive security 
could be a black box reduction to the adaptively secure 
schemes of Lewko er al. Unfortunately, this seems in- 
feasible. Consider any direct black box reduction to the 
security of the underlying scheme. When the attacker 
makes a query to some transformation key, the reduction 
algorithm has two options. First, it could ask the security 
game for the underlying ABE system for a private key. 
Yet, it might turn out that the key both is never corrupted 
and is capable of decryption for the eventual challenge 
ciphertext. In this case the simulator will have to abort. 
A second option is for the reduction algorithm not to ask 
for such a key, but fill in the transformation key itself. 
However, if that user’s key is later corrupted it will be 
difficult for the reduction to both ask for such a private 
key and match it to the published transformation key. 


Accordingly, to prove security one needs to make a 
direct Dual-System encryption type proof. The proof 
would go along the lines of Lewko et al., with the ex- 
ception that in the hybrid stage of the proof all private 
keys and transformational keys will be set (one by one) 
to be semi-functional including those that could decrypt 
the eventual challenge ciphertext. In the Lewko et al. 
proof giving a private key that could decrypt the chal- 
lenge ciphertext would undesirably result in the sim- 
ulator producing observably incorrect correlations be- 
tween the challenge ciphertext and keys. However, if 
we only give out the transformation part of such a key 
(and keep the whole private key hidden) then this cor- 
relation will remain hidden. This part of the argument 
is somewhat similar to the work of Lewko, Rouselakis, 
and Waters [29], who show that in their leakage resilient 
ABE scheme if only part of a private key is leaked such 
a correlation will be hidden. 
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5.2 Checking the Transformation 


In the description of our systems a proxy will be able 
to transform any ABE ciphertext into a short ciphertext 
for the user. While the security definitions show that an 
attacker will not be able to learn an encrypted message, 
there is no guarantee on the transformation’s correctness. 
In some applications a user might want to request the 
transformation of a particular ciphertext and (efficiently) 
check that the transformation was indeed done correctly 
(assuming the original ciphertext was valid). It is easy to 
adapt our RCCA systems to such a setting. Since decryp- 
tion results in recovery of the ciphertext randomness, one 
can simply add a tag to the ciphertext as H’(r), where H’ 
is a different hash function modeled as a random oracle 
and r is the ciphertext randomness. On recovery of r the 
user can compute H'(r) and make sure it matches the tag. 


6 Performance in Practice 


To validate our results, we implemented the CPA-secure 
CP-ABE of Section 3 as an extension to the libfenc At- 
tribute Based Encryption library [25]. We then used this 
as a building block for a platform for accelerating ABE 
decryption through cloud-based computing resources. 
The core of our solution is a virtualized outsourcing 
“proxy” that runs in the Amazon Elastic Compute Cloud 
(EC2). Our proxy exists as a machine image that can 
be programmatically instantiated by any application that 
requires assistance with ABE decryption. As we demon- 
strate below, this proxy is particularly useful for accel- 
erating decryption on constrained devices such as mo- 
bile phones. However, the system can be used in any 
application where significant numbers of ABE decryp- 
tions must be performed, e.g., in large-scale search op- 
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erations.!° The use of on-demand computing is particu- 
larly well-suited to our outsourcing techniques, since we 
do not require trusted remote servers or long-term stor- 
age of secrets. 


System Architecture. Figure 7 illustrates the architec- 
ture of our outsourcing platform. The proxy is stored in 
Amazon’s §3 datastore as a public Amazon Machine Im- 
age (AMI), which wraps a standard Linux/Apache distri- 
bution along with the code needed to execute the Trans- 
form algorithm. Applications can remotely instantiate 
the proxy and upload a TK corresponding to a particu- 
lar ABE decryption key.!' Depending on the use case, 
they can either push ciphertexts to the proxy for transfor- 
mation, or direct the proxy to retrieve ABE ciphertexts 
from remote locations such as the web or the Amazon S3 
storage cloud. The latter technique is helpful when ac- 
cessing remotely-held records on a mobile device, since 
the proxy transformation dramatically reduces the mo- 
bile device’s bandwidth requirements vs. downloading 
and decrypting each ABE ciphertext locally. This can 
significantly enhance device battery life. 


6.1 Performance: Microbenchmarks 


To evaluate the performance of our CPA-secure CP-ABE 
outsourcing scheme in isolation (without confounding 
factors such as network lag, file I/O, etc.) we conducted a 
series of microbenchmarks using the libfenc implemen- 
tation. For consistency, we ran these tests on two dedi- 
cated hardware platforms: a 3GHz Intel Core Duo plat- 
form with 4GB of RAM running 32-bit Linux Kernel 
version 2.6.32, and a 412MHz ARM-based iPhone 3G 
with 128MB of RAM running iOS 4.0.!? We instantiated 
the ABE schemes using a 224-bit MNT elliptic curve 
from the Stanford Pairing-Based Crypto library [30].!* 
The existing libfenc implementation implements the 
Waters scheme using a Key Encapsulation variant. For 
backwards compatibility, we adopted this approach in 
our implementation as well. Herein, the ciphertext car- 
ries a symmetric session key k that is computed at en- 
cryption time as k = H(e(g,g)®*). The element C = 


!0Tdeed, since cloud computing platforms support the creation of 
multiple proxy instances, servers can rapidly scale their outsourcing 
capability up and down to meet demand. 

'lThe proxy requires only one TK to decrypt an unlimited number 
of ciphertexts. However, a proxy can be shared by multiple users, each 
with their own TK. 

!2Note that our tests were single-threaded, and thus used resources 
from only a single core of the Intel processor. In all cases we conducted 
our timing experiments with accessible background services disabled, 
and with the mobile device connected to a power source. 

13 Although we define our schemes in the symmetric bilinear group 
setting, the MNT curve choice required that we implement the scheme 
in asymmetric groups with a pairing of the form G; x Gz — Gr. As 
a result we assigned various elements of the ciphertext and key to the 
groups G; and Gp with the aim of minimizing ciphertext size. 
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M -e(g,g)™ is omitted from the ciphertext, and any data 
payload must be carried via a separate symmetric encryp- 
tion under k. The practical impact of this approach is 
that the ABE ciphertexts (and partially-decrypted cipher- 
texts) are shortened by one element of Gr. 


Experimental setup. Both decryption time and cipher- 
text size in the CP-ABE scheme depend on the com- 
plexity of the ciphertext’s policy. To capture this in our 
experiments, we first generated a collection of 100 dis- 
tinct ciphertext policies of the form (Aj AND Ay AND 
... AND Ay), where each A; is an attribute, for values of 
N increasing from | to 100. In each case we constructed 
a corresponding decryption key that contained the WN at- 
tributes necessary for decryption. This approach ensures 
that the decryption procedure depends on all N compo- 
nents of the ciphertext and is a reasonable sample of a 
complex policy. 

To obtain our baseline results, we encapsulated a ran- 
dom 128-bit symmetric key under each of these 100 dif- 
ferent policies, then decrypted the resulting ABE cipher- 
text using the normal (non-outsourced) Decrypt algo- 
rithm.!+ To smooth any experimental variability, we re- 
peated each of our experiments 100 times on the Intel 
device (due to the time consuming nature of the experi- 
ments, we repeated the test only 30 times on the ARM 
device) and averaged to obtain our decryption timings. 
Figure 8 shows the size of the resulting ciphertexts as a 
function of N, along with the measured decryption times 
on our Intel and ARM test platforms. 

Next, we evaluated the algorithms by generating a 
Transform Key (TK) from the appropriate N-attribute 
ABE decryption key and applying the Transform algo- 
rithm to the ABE ciphertext using this key.!> Finally we 
decrypted the resulting transformed ciphertext. Figure 8 
shows the time required for each of those operations. 


Discussion. As expected, the ABE ciphertext size and 
decryption/transform time were linear in the complexity 
of the ciphertext’s policy (N). However, our results illus- 
trate the surprisingly high constants. Encrypting under a 
100-component ciphertext policy produced an unwieldy 
25KB of ABE ciphertext. The relatively fast Intel proces- 
sor required nearly 2 full seconds to decrypt this value. 
By comparison, the same machine can perform a 1024- 
bit RSA decryption in 1.7 milliseconds.'® 

The results were more dramatic on the mobile device. 
Decrypting a 100-component ciphertext policy on the 


'4Note that for this experiment we did not employ any symmetric 
encryption, hence all times and ciphertext sizes refer to the ABE key 
encapsulation ciphertext. 

15 We used the “backwards-compatible” key generation approach de- 
scribed in Section 3.1 to derive a TK from a standard ABE decryption 
key, rather than having the PKG generate the TK directly. This allowed 
us to retain compatibility with the existing CP-ABE implementation. 

16Measured with OpenSSL 1.0 [40]. 


20th USENIX Security Symposium 533 


534 


ABE Ciphertext Size 


Partially-decrypted Ciphertext Size 


ABE Decryption Time 

































































1 
Ciphertext 0.9F Ciphertext 4 
+f a » 08F 4 8 
g 4 2 O7F 4 Ss 
s BS 06F 4 8 
< za x< L 4 o 
c = 0.5 
= = 04 4 = 
8 7 & 03+ 4 z 
4 0.2 — 
0.1 4 
0 1 1 1 1 
20 40 60 80 100 0 20 40 60 80 100 
Number of policy attributes (N) Number of policy attributes (N) Number of policy leaves (N) 
Outsourcing Keygen (Time) Transform (Time) Final Decryption (Time) 
2 0.07 T T T T 
Intel 02 1.8 Intel potagnncnyne retort, eallttel rl 
wo ARM «====»¢% ” 16 w (0.06 ARM =sss=== 
2 = 2 44 2 005+ 4 
8 - Ge ee 8 
© as g 12 ol ® 0.04 - 5 
o i 4 ao 1 “J ao 
= a = I - | 
= a J = 0.8 4 = 0.03 
E Hl =. 08 1 E 0.02 5 4 
- a - ; 4 - 
4 02 1 0.01 + 4 
0 0 1 1 1 1 
0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 


Number of key attributes (N) 


Number of key attributes (N) 


Number of key attributes (N) 


Figure 8: Microbenchmark results for our CP-ABE scheme with outsourcing. Timing results are provided for both 
Intel and ARM platforms. Key generation times represent the time to convert a standard ABE decryption key into 
an outsourcing key, using the “backwards-compatible” approach described in Section 3.1. “Final decryption” refers 
to the decryption of a partially-decrypted ciphertext. Note that we present the Transform timing results for the Intel 
platform only, since we view this as the more likely outsourcing platform. Intel (resp. ARM) timings represent the 


average of 100 (resp. 30) test iterations. 


ARM processor required nearly 30 seconds of sustained 
computation. Even at lower policy complexities, our re- 
sults seem problematic for implementers looking to de- 
ploy unassisted ABE on limited computing devices. 

Outsourcing substantially reduced both ciphertext size 
and the time needed to decrypt the partially-decrypted ci- 
phertext. Each partially-decrypted ciphertext was a fixed 
188 bytes in size, regardless of the original ciphertext’s 
CP-ABE policy. Furthermore, the final decryption pro- 
cess required only 4ms on the Intel processor and a man- 
ageable 60ms on ARM." Thus, it appears that outsourc- 
ing can provide a noticeable decryption time advantage 
for ciphertexts with 10 or more attributes. 


Other Implementation Remarks. There are several opti- 
mizations and tradeoffs one might explore that could im- 
pact both the performance of the existing ABE scheme 
and our outsourced scheme. We chose to use the PBC 
library due to its use in the libfenc system and its simple 
API. However, PBC does not include all of the latest op- 
timizations discussed in the research literature. Other fu- 
ture optimizations could include the use of multi-pairings 
for decryption. We emphasize that while using such op- 


17We conducted our experiments on the CPA-secure version of our 
scheme. The primary performance differences in the RCCA version 
are an extra exponentiation in Gr and some additional bytes. 
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timizations to the existing ABE systems could give some 
performance improvements, they will not improve the 
size of ABE ciphertexts. Furthermore, decryption time 
will still be linear in the size of the satisfied formula, 
whereas our outsourcing technique transforms the final 
decryption step to a short El-Gamal-type ciphertext. 


A note on policy complexity. The reader might assume 
that 50- or 100-component policies are rare in practice. 
In fact, we observed that it is relatively easy to arrive 
at highly complex policies in typical use cases. This is 
particularly true when using policies that contain integer 
comparison operators, e.g., “AGE < 30”. The libfenc li- 
brary implements integer comparison operators using the 
technique of Bethencourt et al. [7]: prior to encryption, 
each comparison operator is converted into a boolean 
policy circuit composed of OR and AND gates, and the 
resulting policy is applied to the ciphertext. Comparing 
an attribute to a fixed n-bit integer adds approximately 
n components to the policy. For example, without spe- 
cial optimizations, a restriction window involving a Unix 
time value (x < KEY_CREATION_TIME << y) increases 
the policy size by approximately 64 components. 


USENIX Association 











Operation local-only — local+web proxy proxy+web 
(sec) (sec/kb) (sec/kb) (sec/kb) 
New proxy instantiation 93.4 sec 93.4 sec 
Restart existing proxy instance 45 sec 45 sec 
Generate & set 70-element transform key 2.9 sec 2.9 sec 
Decryption: 
((DOCTOR OR NURSE) AND INSTITUTION) 1.1s 1.2s/1.1k .28/1.4k .28/0.4k 
(DOCTOR AND TIME > 1262325600 AND TIME < 1267423200) 17.3s 17.38/22.8k — 1.2s/23.2k 1.2s/0.4k 





Figure 9: Some average performance results for the proxy-enhanced iHealthEHR application running on our iPhone 
3G. From left to right, “local-only” indicates device-local decryption and storage of ciphertexts, “local+web” indicates 
that ciphertexts were downloaded from a web server and decrypted at the device. “proxy” indicates local ciphertext 
storage with proxy outsourcing. “proxy+web” indicates that ciphertexts were obtained from the web via the proxy. 
Where relevant we provide both timings and total bandwidth transferred (up+down) from the device. Note that proxy 
launch times exhibit some variability depending on factors outside of our control. 


6.2 Performance: Mobile Example 


To validate our ideas in a real application, we incorpo- 
rated outsourcing into the iPhone viewer component of 
iHealthEHR [3], an experimental system for distributing 
Electronic Health Records (EHRs). Since EHRs can con- 
tain highly sensitive data, iHealthEHR uses CP-ABE to 
perform end-to-end encryption of records from the orig- 
ination point to the viewing device. Distinct ciphertext 
policies may be applied to each node in an individual’s 
health record (e.g., to admit special permissions for psy- 
chiatric records). iHealthEHR supports both local and 
cloud-based storage of records. 

We modified the iPhone application to remotely 
instantiate our outsourcing proxy on startup, using 
a “small” server instance within Amazon’s storage 
cloud.!® In our experiments we found that the first EC2 
instantiation required anywhere from 1-3 minutes, pre- 
sumably depending on the system’s load. However, once 
the proxy was launched, it could be left running indefi- 
nitely and shared by many different users with different 
TKs, or — when not in use — paused and brought back 
to full operation in as little as 30 seconds (with an av- 
erage closer to 45 seconds). During this startup interval 
we set the application to locally process all decryption 
operations. Once the proxy signaled its availability, the 
application pushed a TK to it via HTTP, and outsourced 
all further decryption operations. 

To evaluate the performance implications, we con- 
ducted experiments on the system with outsourcing en- 
abled and disabled, considering four likely usage sce- 
narios. In the first scenario (local-only), we conducted 
device-local decryption on ciphertexts stored locally in 
the device’s Flash memory. In the second scenario (lo- 
cal+web) we downloaded ciphertexts from a web server, 


18 According to Amazon’s documentation, a small EC2 instance pro- 
vides “the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 
2007 Xeon processor” and 1.7GB of RAM, at a cost of USD $0.085/hr. 
[1]. 


USENIX Association 


then decrypted them locally at the device. In the third 
scenario (proxy), we stored ciphertexts locally and then 
uploaded them to the proxy for transformation. In the 
final scenario (proxy+web) ciphertexts were retrieved 
from a web server by the proxy, then Transformed be- 
fore being sent to the device. In each case we measured 
the time required to decrypt, along with the total band- 
width transmitted and received by the device (excepting 
the local-only case, which did not employ the network 
connection). The results are summarized in Figure 9. 


7 Hardening ABE Implementations 


Thus far we described outsourcing solely as a means to 
improve decryption performance. In certain cases out- 
sourcing can also be used to enhance security. By way 
of motivation, we observe that ABE implementations 
tend to be relatively complex compared to implementa- 
tions of other public-key encryption schemes. For ex- 
ample, libfenc’s policy handling components alone com- 
prise nearly 3,000 lines of C code, excluding library de- 
pendencies. It has been observed that the number of vul- 
nerabilities in a software product tends to increase in pro- 
portion to the code’s complexity [34]. 

It is common for designers to mitigate software issues 
by sandboxing vulnerable processes e.g., [33], or through 
techniques that isolate security-sensitive functions within 
a process [32]. McCune et al. recently proposed TrustVi- 
sor [31], a specialized hypervisor designed to protect and 
isolate security-sensitive “Pieces of Application Logic” 
(PALs) from less sensitive code. 

We propose outsourcing as a tool to harden ABE im- 
plementations in platforms with code isolation. For ex- 
ample, in a system equipped with TrustVisor, imple- 
menters can embed the relatively simple key generation 
and Decrypt,,; routines in security-sensitive code (e.g., 
a TrustVisor PAL) and use outsourcing to push the re- 
maining calculations into non-sensitive code. This not 
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only reduces the size of the sensitive code base, it also 
simplifies parameter validation for the PAL (since the 
partially-decrypted ABE ciphertext is substantially less 
complex than the original). We refer to this technique 
as “self-outsourcing” and note that it can also be used 
in systems containing hardware security modules (e.g., 
cryptographic smart cards). Moreover, based on our ex- 
periments of Section 6, we estimate that this approach 
will have a minimal impact on performance. 
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A. Proof of Theorem 3.2 


Proof. Suppose there exists a polynomial-time adversary 
@ that can attack our scheme in the selective RCCA- 
security model for outsourcing with advantage €. We 
build a simulator & that can attack the Waters scheme 
of [42, Appendix C] in the selective CPA-security model 
with advantage € minus a negligible amount. In [42] the 
Waters scheme is proven secure under the decisional q- 
parallel BDHE assumption. 

Init. The simulator Z runs <&. & chooses the chal- 
lenge access structure (M*,p*), which & passes on to 
the Waters challenger as the structure on which it wishes 
to be challenged. 


20th USENIX Security Symposium 537 


538 


Setup. The simulator # obtains the Waters public 
parameters PK = g,e(g,g)%,g% and a description of the 
hash function F. It sends these to & as the public pa- 
rameters. 


Phase 1. ‘The simulator # initializes empty tables 
T,T,, 7), an empty set D and an integer j = 0. It answers 
the adversary’s queries as follows: 


e Random Oracle Hash H1(R,-@): If there is an en- 
try (R,.@,s) in Ti, return s. Otherwise, choose a 
random s € Zp, record (R,-@,s) in T; and return s. 

e Random Oracle Hash H2(R): If there is an entry 
(R,r) in To, return r. Otherwise, choose a random 
r € {0,1}*, record (R,r) in 7 and return r. 

e Create((S)): & sets j := j +1. It now proceeds one 
of two ways. 


- If S satisfies (M*,p*), then it chooses a “fake” 
transformation key as follows: choose a ran- 
dom d € Z, and run KeyGen((d,PK),S) to 
obtain SK’. Set TK = SK’ and set SK = 
(d, TK). Note that the pair (d, TK) is not well- 
formed, but that TK is properly distributed if d 
was replaced by the unknown value z = a/d. 

— Otherwise, it calls the Waters key genera- 
tion oracle on S to obtain the key SK’ = 
(PK,K’,L', {Ki },es). (Recall that in the 
non-outsourcing CP-ABE game, the Create 
and Corrupt functionalities are combined in 
one oracle.) The algorithm chooses a ran- 
dom value z € Zp and sets the transfor- 
mation key TK as (PK, K = K"/<,L = 
Le, {K}.es = fie) and the private 
key as (z, TK). 


Finally, store (j,S,SK,TK) in table T and return TK 
to &. 

Corrupt(i): < cannot ask to corrupt any key cor- 
responding to the challenge structure (M*,p*). If 
there exists an ith entry in table 7, then # obtains 
the entry (i,S,SK,TK) and sets D:= DU {S}. It 
then returns SK to &, or | if no such entry exists. 
Decrypt(i,CT): Without loss of generality, we as- 
sume that all ciphertexts input to this oracle are al- 
ready partially decrypted. Recall that both & and 
&f have access to the TK values for all keys created, 
so either can execute the transformation operation. 
Let CT = (Co,Ci,C2) be associated with structure 
(M,p). Obtain the record (i,S,SK,TK) from table 
T. If it is not there or S ¢ (M,p), return | to 7%. 

If key i does not satisfy the challenge structure 
(M* ,p*), proceed as follows: 


1. Parse SK = (z, TK). Compute R = Co/C35. 
2. Obtain the records (R,-4@;,s;) from table 7). If 
none exist, return | to &. 
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3. If in this set, there exists indices y # x such 
that (R, Z,, sy) and (R,.@,5,) are in table Tj, 
M, - M, and sy = sx, then & aborts the sim- 
ulation. 

4. Otherwise, obtain the record (R,r) from table 
T). If it does not exist, A outputs _L. 

5. Foreachi, test if Cp = R-e(g,g)“', Ci = GO 
rand C7 = e(g,g)*/2, 

6. If there is an i that passes the above test, output 
the message .@;; otherwise, output |. (Note: 
at most one value of s;, and thereby one index 
i, can satisfy the third check of the above test.) 


If key i does satisfy the challenge structure 
(M*,p*), proceed as follows: 


1. Parse SK = (d,TK). Compute B = C,!“. 
2. For each record (R;,-4,5;) in table 7), test if 
B=e(g,8)". 

. If zero matches are found, 4 outputs L to 2. 

4. If more than one matches are found, & aborts 
the simulation. 

5. Otherwise, let (R,.@,s) be the sole match. 
Obtain the record (R,r) from table 7). If it 
does not exist, # outputs L. 

6. Test if Co = R-e(g,g)™%, Ci = WGrandC) = 
e(g,a)™. 

7. If all tests pass, output ./; else, output L. 


Ow 


Challenge. Eventually, </ submits a message pair 
(M5, Mi) € {0,1}°**. B acts as follows: 


1. Bchooses random “messages” (Zo, #1) € G7. and 
passes them on to the Waters challenger to obtain a 
ciphertext CT = (C,C’, {Ci}icii,q) under (M*,p*). 

2. B chooses a random value C” € {0,1}*. 

3. @ sends to & the challenge ciphertext CT* = 
(Cc .c", {Citietq)- 


Phase 2. The simulator Z continues to answer queries 
as in Phase 1, except that if the response to a Decrypt 
query would be either 4 or .@/', then # responds with 
the message test instead. 


Guess. Eventually, < must either output a bit or abort, 
either way # ignores it. Next, @ searches through tables 
T; and 7> to see if the values Zo or &, appear as the 
first element of any entry (i.e., that </ issued a query of 
the form H,(&;,-) or H2(&;).) If neither or both values 
appear, & outputs a random bit as its guess. If only value 
&y appears, then 4 outputs b as its guess. 


This ends the description of the simulation. Due to space 
limitations, our analysis of this simulation appears in the 
full version of this work [26]. 
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Abstract 


Secure two-party computation enables two parties to 
evaluate a function cooperatively without revealing to ei- 
ther party anything beyond the function’s output. The 
garbled-circuit technique, a generic approach to secure 
two-party computation for semi-honest participants, was 
developed by Yao in the 1980s, but has been viewed 
as being of limited practical significance due to its in- 
efficiency. We demonstrate several techniques for im- 
proving the running time and memory requirements of 
the garbled-circuit technique, resulting in an implemen- 
tation of generic secure two-party computation that is 
significantly faster than any previously reported while 
also scaling to arbitrarily large circuits. We validate our 
approach by demonstrating secure computation of cir- 
cuits with over 10° gates at a rate of roughly 10 us per 
garbled gate, and showing order-of-magnitude improve- 
ments over the best previous privacy-preserving proto- 
cols for computing Hamming distance, Levenshtein dis- 
tance, Smith-Waterman genome alignment, and AES. 


1 Introduction 


Secure two-party computation enables two parties to 
evaluate an arbitrary function of both of their inputs with- 
out revealing anything to either party beyond the output 
of the function. We focus here on the semi-honest set- 
ting, where parties are assumed to follow the protocol 
but may then attempt to learn information from the pro- 
tocol transcript (see further discussion in Section 1.2). 

There are two main approaches to constructing proto- 
cols for secure computation. The first approach exploits 
specific properties of f to design special-purpose proto- 
cols that are, presumably, more efficient than those that 
would result from generic techniques. A disadvantage of 
this approach is that each function-specific protocol must 
be designed, implemented, and proved secure. 


“Work done while at the University of Maryland. 
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The second approach relies on completeness theorems 
for secure computation [7, 8, 34] which give protocols 
for computing any function f starting from a Boolean- 
circuit representation of f. This generic approach to se- 
cure computation has traditionally been viewed as being 
of theoretical interest only since the protocols that result 
require several symmetric-key operations per gate of the 
circuit being executed and the circuit corresponding to 
even a very simple function can be quite large. 

Beginning with Fairplay [22], several implementa- 
tions of generic secure two-party computation have been 
developed in the past few years [11, 21,27] and used 
to build privacy-preserving protocols for various func- 
tions (e.g., [4, 13, 16,26,29]). Fairplay and its successors 
demonstrated that Yao’s technique could be implemented 
to run in a reasonable amount of time for small circuits, 
but left the impression that generic protocols for secure 
computation could not scale to handle large circuits or in- 
put sizes or compete with special-purpose protocols for 
functions of practical interest. Indeed, some previous 
works have explicitly rejected garbled-circuit solutions 
due to memory exhaustion [16, 26]. 

The thesis of our work is that design decisions made 
by Fairplay, and followed in subsequent work, led re- 
searchers to severely underestimate the applicability of 
generic secure computation. We show that protocols con- 
structed using Yao’s garbled-circuit technique can out- 
perform special-purpose protocols for several functions. 


1.1 Contributions 


We show a general method for implementing privacy- 
preserving applications using garbled circuits that is both 
faster and more scalable than previous approaches. Our 
improvements are of two types: we improve the effi- 
ciency and scalability of garbled circuit execution itself, 
and we provide a flexible framework that allows pro- 
grammers to optimize various aspects of the circuit for 
computing a given function. 
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Hamming Distance (900 bits) 


Levenshtein Distance 


AES 





Online Time 


Overall Time 


Overall Time* 


Overall Time* 


Online Time 


Overall Time 
































Best Previous | 0.310 s [26] 213 s [26] 92.45 534s 0.4s [11] 3.3 s [11] 
Our Results 0.019 s 0.051 s 4.1s 18.458 0.008 s 0.2 s 
Speedup 16.3 4176 22.5 29 50 16.5 





Table 1: Performance comparisons for several privacy-preserving applications. 


+ Inputs are 100-character strings over an 8-bit alphabet. The best previous protocol is the circuit-based protocol of [16]. 


£ Inputs are 200-character strings over an 8-bit alphabet. The best previous protocol is the main protocol of [16]. 





Garbled-circuit execution. In previous garbled-circuit 
implementations including Fairplay, the garbled circuit 
(whose length is several hundreds bits per binary gate) 
is fully generated and loaded in memory before circuit 
evaluation starts. This impacts both the efficiency of the 
resulting implementation and severely limits its scalabil- 
ity. We observe that it is unnecessary to generate and 
store the entire garbled circuit at once. By topologically 
sorting the gates of the circuit and pipelining the process 
of circuit generation and evaluation we can significantly 
improve overall efficiency and scalability. Our imple- 
mentation never stores the entire garbled circuit, thereby 
allowing it to scale to effectively an unlimited number of 
gates using a nearly constant amount of memory. 


We also employ all known optimizations, includ- 
ing the “free XOR” technique [18], garbled-row reduc- 
tion [27], and oblivious-transfer extension [14]. Sec- 
tion 2 provides cryptographic background and explains 
the protocol and optimizations we use. 


Programming framework. Developing and debugging 
privacy-preserving applications using existing compil- 
ers is tedious, cumbersome, and slow. For example, it 
takes several hours for Fairplay to compile an AES pro- 
gram written in SFDL, even on a computer with 40 GB 
of memory. Moreover, the high-level programming ab- 
straction provided by Fairplay and other tools for secure 
computation obscures important opportunities for gener- 
ating more compact circuits. Although this design de- 
cision stems from the worthy goal of providing a high- 
level programming interface for secure computation, it is 
severely detrimental to performance. In particular, exist- 
ing compilers (1) automatically garble the entire circuit, 
even when portions of the circuit can be computed lo- 
cally without compromising privacy; (2) use more gates 
than necessary, since they always use the maximum num- 
ber of bits needed for a particular variable, even when the 
number of bits needed at some intermediate stage might 
be significantly lower; (3) miss important opportunities 
to replace general gates with XOR gates (which can be 
garbled “for free” [18]); and (4) miss opportunities to use 
special-purpose (e.g., multiple input/output) gates that 
may be more efficient than binary gates. TASTY [11] 
provides a bit more control, by allowing the programmer 


20th USENIX Security Symposium 


to decide when to use depth-2 arithmetic circuits (which 
can be computed using homomorphic encryption) rather 
than Boolean circuits. However, this is not enough to 
support many important circuit optimizations and there 
are limited places where using homomorphic encryption 
improves performance over an efficient garbled-circuit 
implementation. 

We present a new method and supporting framework 
for generating efficient protocols for secure two-party 
computation. Our method enables programmers to gen- 
erate a secure protocol computing some function f from 
an existing (insecure) implementation of f, while pro- 
viding enough control over the circuit design to enable 
key optimizations to be employed. Our approach al- 
lows users to write their programs using a combination 
of high-level and circuit-level Java code. Programmers 
need to be able to design Boolean circuits, but do not 
need to be cryptographic experts. Our framework en- 
ables circuits to be built and evaluated modularly. Hence, 
even very complex circuits can be generated, evaluated, 
and debugged. This also provides the programmer with 
opportunities to introduce important circuit-level opti- 
mizations. Although we hope that such optimizations 
can eventually be done automatically by sophisticated 
compilers, our emphasis here is on providing a frame- 
work that makes it easy to implement privacy-preserving 
applications. Section 3 provides details about our imple- 
mentation and efficiency improvements. 


Results. We explore applications of our framework 
to several problems considered in prior work including 
secure computation of Hamming distance (Section 4) 
and Levenshtein (edit) distance (Section 5), privacy- 
preserving genome alignment using the Smith-Waterman 
algorithm (Section 6), and secure evaluation of the AES 
block cipher (Section 7). As summarized in Table 1, our 
implementation yields privacy-preserving protocols that 
are an order of magnitude more efficient than prior work, 
in some cases beating even special-purpose protocols de- 
signed (and claimed) to be more efficient than what could 
be obtained using a generic approach.! 


‘Results for the Smith-Waterman algorithm are not included in the 
table since there is no prior work for meaningful comparison, as we 
discuss in Section 6. 
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1.2 Threat Model 


In this work we adopt the semi-honest (also known as 
honest-but-curious) threat model, where parties are as- 
sumed to follow the protocol but may attempt to learn 
additional information about the other party’s input from 
the protocol transcript. Although this is a very weak 
security model, it is a standard security model for se- 
cure computation, and we refer the reader to Goldreich’s 
text [7] for details. 

Studying protocols in the semi-honest setting is rele- 
vant for two reasons: 


e There may be instances where a semi-honest threat 
model is appropriate: (1) when parties are legit- 
imately trusted but are prevented from divulging 
information for legal reasons, or want to protect 
against future break-ins; or (2) where it would be 
difficult for parties to change the software without 
being detected, either because software attestation 
is used or due to internal controls in place (for ex- 
ample, when parties represent corporations or gov- 
ernment agencies). 


e Protocols for the semi-honest setting are an impor- 
tant first step toward constructing protocols with 
stronger security guarantees. There exist generic 
ways of modifying the garbled-circuit approach to 
give covert security [1] or full security against ma- 
licious adversaries [19, 20, 25, 30]. 


Further, our implementation could be modified eas- 
ily so as to give meaningful privacy guarantees even 
against malicious adversaries. Specifically, consider a 
setting in which only one party P) (the circuit evaluator; 
see Section 2.1) receives output, and the protocol is im- 
plemented not to reveal to the other party P; anything 
about the output (including whether or not the protocol 
completed successfully). If an oblivious-transfer proto- 
col with security against malicious adversaries is used 
(see Section 2.2), our implementation achieves full se- 
curity against a malicious P) and privacy against a ma- 
licious P;. In particular, neither party learns anything 
about the other party’s inputs beyond what P; can infer 
about P;’s input from the revealed output. Understand- 
ing how much private information the output itself leaks 
is an important and challenging problem, but outside the 
scope of this paper. 

Note that this usage of our protocols provides privacy, 
but does not provide any correctness guarantees. A mali- 
cious generator could construct a circuit that produces an 
incorrect result without detection. Hence, this approach 
is insufficient for scenarios where the circuit generator 
may be motivated to trick the evaluator by producing 
an incorrect result. Such scenarios would require fur- 
ther defenses, including mechanisms to prevent parties 
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from lying about their inputs. Many interesting privacy- 
preserving applications do have the properties needed for 
our approach to be effective. Namely, (1) both parties 
have a motivation to produce the correct result, and (2) 
only one party needs to receive the output. Examples 
include financial fraud detection (banks cooperate to de- 
tect fraudulent accounts), personalized medicine (a pa- 
tient and drug company cooperate to determine the best 
treatment), and privacy-preserving face recognition. 


2 Cryptographic Background 


This section briefly introduces the cryptographic tools 
we use: garbled circuits and oblivious transfer. We adapt 
and implement protocols from the literature, and there- 
fore do not include proofs of security in this work. The 
protocol we implement can be proven secure based on 
the decisional Diffie-Hellman assumption in the random 
oracle model [2]. 


2.1 Garbled Circuits 


Garbled circuits allow two parties holding inputs x and 
y, respectively, to evaluate an arbitrary function f(x,y) 
without leaking any information about their inputs be- 
yond what is implied by the function output. The ba- 
sic idea is that one party (the garbled-circuit genera- 
tor) prepares an “encrypted” version of a circuit com- 
puting f; the second party (the garbled-circuit evalua- 
tor) then obliviously computes the output of the circuit 
without learning any intermediate values. 

Starting with a Boolean circuit for f (which both par- 
ties fix in advance), the circuit generator associates two 
random cryptographic keys w?, w} with each wire i of the 
circuit (w? encodes a 0-bit and w} encodes a 1-bit). Then, 
for each binary gate g of the circuit with input wires 7, j 
and output wire k, the generator computes ciphertexts 

Enck b; (wpe?) 

w;'w i 

for all inputs b;,b; € {0,1}. (See Section 3.4 for details 
about the encryption used.) The resulting four cipher- 
texts, in random order, constitute a garbled gate. The 
collection of all garbled gates forms the garbled circuit 
that is sent to the evaluator. In addition, the generator 
reveals the mappings from output-wire keys to bits. 

The evaluator must also obtain the appropriate keys 
(that is, the keys corresponding to each party’s actual in- 
put) for the input wires. The generator can simply send 
Wega , the keys that correspond to its own input 
where each w,’ corresponds to the generator’s i” input 
bit. The parties use oblivious transfer (see Section 2.2) 
to enable the evaluator to obliviously obtain the input- 
wire keys corresponding to its own inputs. 
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Given keys w;,w; associated with both input wires i, j 
of some garbled gate, the evaluator can compute a key for 
the output wire of that gate by decrypting the appropriate 
ciphertext. As described, this requires up to four decryp- 
tions per garbled gate, only one of which will succeed. 
Using standard techniques [22], the construction can be 
modified so a single decryption suffices. Thus, given one 
key for each input wire of the circuit, the evaluator can 
compute a key for each output wire of the circuit. Given 
the mappings from output-wire keys to bits (provided by 
the generator), this allows the evaluator to compute the 
actual output of f. If desired, the evaluator can then send 
this output back to the circuit generator (as noted in Sec- 
tion 1.2, sending the output back to the generator is a pri- 
vacy risk unless the semi-honest model can be imposed 
through some other mechanism). 


Optimizations. Several optimizations can be applied 
to the standard garbled circuits protocol, all of which 
we use in our implementation. Kolensikov and Schnei- 
der [18] introduce a technique that eliminates the need 
to garble XOR gates (so XOR gates become “free”, in- 
curring no communication or cryptographic operations). 
Pinkas et al. [27] proposed a technique to reduce the size 
of a garbled table from four to three ciphertexts, thus sav- 
ing 25% of network bandwidth.” 


2.2 Oblivious Transfer 


One-out-of-two oblivious transfer (OT?) [5,28] is a cru- 
cial component of the garbled-circuit approach. An Or 
protocol allows a sender, holding strings w°, w!, to trans- 
fer to a receiver, holding a selection bit b, exactly one of 
the inputs w’; the receiver learns nothing about w!~?, 
and the sender does not learn b. Oblivious transfer 
has been studied extensively, and several protocols are 
known. In our implementation we use the Naor-Pinkas 
protocol [24], secure in the semi-honest setting. We also 
use oblivious-transfer extension [14] which can achieve 
a virtually unlimited number of oblivious transfers at the 
cost of (essentially) k executions of OT? (where k is a sta- 
tistical security parameter) plus a marginal cost of a few 
symmetric-key operations per additional OT. In our im- 
plementation, the time for computing the “base” k = 80 
oblivious transfers is about 0.6 seconds, while the on-line 
time for each additional OT? is roughly 15 ps. 

For completeness, we note that there are known 
oblivious-transfer protocols with stronger security prop- 
erties [10], as well as techniques for oblivious-transfer 
extension that are secure against malicious adver- 
saries [9]. These could easily be integrated with our im- 
plementation to provide the stronger privacy properties 


2A second proposed optimization reduces the size by approximately 
50%, but cannot be combined with the free-XOR technique. 
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for situations where the result does not go back to the 
circuit generator as discussed in Section 1.2. 


3 Implementation Overview 


Our implementation allows programmers to construct 
protocols in a high-level language while providing 
enough control over the circuit design to enable efficient 
implementations. The source code for the system and all 
the applications described in this paper are available un- 
der an open-source license from http://MightBeEvil.org. 
Our code base is very small: the main framework is 
about 1500 lines of Java code, and a circuit library (see 
Section 3.3) contains an additional 700 lines of code. 
The main features of our framework that enable efficient 
protocols are its support for pipelined circuit execution 
(Section 3.1) and the optimizations enabled by its circuit- 
level representation that allow developers to minimize 
the number of garbled gates needed (Section 3.2). Sec- 
tion 3 describes our circuit library and how a programmer 
defines a new circuit component. Section 3.4 describes 
implementation parameters used in our experiments. 


3.1 Pipelined Circuit Execution 


The primary limitation of previous garbled-circuit imple- 
mentations is the memory required to store the entire cir- 
cuit in memory. There is no need, however, for either 
the circuit generator or evaluator to ever hold the entire 
circuit in memory. The circuit generation and evaluation 
processes can be overlapped in time (pipelined), elim- 
inating the need to ever store the entire garbled circuit 
in memory as well as the need for the circuit generator 
to delay transmission until the entire garbled circuit is 
ready. In our framework, the processing of the garbled 
gates is pipelined to avoid the need to store the entire cir- 
cuit and to improve the running time. This is automated 
by our framework, so a user only needs to construct the 
desired circuit. 

At the beginning of the evaluation both the circuit 
generator and the circuit evaluator instantiate the cir- 
cuit structure, which is known to both of them and is 
fairly small since it can reuse components just like a non- 
garbled circuit. When the protocol is executed, the gener- 
ator transmits garbled gates over the network as they are 
produced, in an order defined by the circuit structure. As 
the client receives the garbled gates, it associates them 
with the corresponding gate of the circuit. Note that the 
order of generating and evaluating the circuit does not 
depend on the parties’ inputs (indeed, it cannot since that 
would leak information about those inputs), so there is no 
overhead required to keep the two parties synchronized. 

The evaluator then determines which gate to evaluate 
next based on the available output values and tables. Gate 


USENIX Association 


evaluation is triggered automatically when all the neces- 
sary inputs are ready. Once a gate has been evaluated it 
is immediately discarded, so the number of truth tables 
stored in memory is minimal. Evaluating larger circuits 
does not significantly increase the memory load on the 
generator or evaluator, but only affects the network band- 
width needed to transmit the garbled tables. 


3.2 Generating Compact Circuits 


To build an efficient two-party secure computation pro- 
tocol, a programmer first analyzes the target applica- 
tion to identify the components that need to be com- 
puted privately. Then, those components are translated 
to digital circuit designs, which are realized as Java 
classes. Finally, with support from our framework’s core 
libraries, the circuits are compiled and packaged into 
server-side and client-side programs that jointly instan- 
tiate the garbled-circuit protocol. 


The cost of evaluating a garbled circuit protocol scales 
linearly in the number of garbled gates. The efficiency 
of our approach is due to the pipelined circuit execu- 
tion technique described above, as well as several meth- 
ods we use to minimize the number of non-XOR gates 
that need to be evaluated. One way to reduce the num- 
ber of gates is to identify parts of the computation that 
only require private inputs from one party. These com- 
ponents can be computed locally by that party so do not 
require any garbled circuits. By designing circuits at the 
circuit level rather than using a high-level language like 
SFDL [22], we are able to take advantage of these op- 
portunities (for example, by computing the key schedule 
for AES locally; see Section 7). For the parts of the com- 
putation that need to be done cooperatively, we exploit 
several opportunities enabled by our approach to reduce 
the number of non-XOR gates needed. 


Minimizing bit width. To improve performance, our 
circuits are constructed with the minimal width required 
for the correctness of the programs. Our framework sup- 
ports this by allowing most library circuits to be instan- 
tiated with a parameter that specifies the sizes of the in- 
puts, a flexibility that was not present in prior implemen- 
tations of secure computation. For example, SFDL’s sim- 
plicity encourages programmers to count the number of 
1s in a 900-bit number by writing code that leads to a 
circuit using 10-bit accumulators throughout the compu- 
tation even though narrower accumulators are sufficient 
for early stages. The Hamming distance, Levenshtein 
distance, and Smith-Waterman applications described in 
this paper all reduce width whenever possible. This has 
a significant impact on the overall efficiency: for exam- 
ple, it reduces the number of garbled gates needed for our 
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Levenshtein-distance protocol by 20% (see Section 5.2). 


Fast table lookups. Constant-size lookup tables are fre- 
quently used in real-world applications (e.g., the score 
matrix for Smith-Waterman and the SBox for AES). 
Such lookup tables can be efficiently implemented as a 
single generalized m-to-n garbled gate, where m is num- 
ber of bits needed to represent the index and 7 is the num- 
ber of bits needed to represent each table entry. This, 
in turn, can be implemented within as a garbled cir- 
cuit using a generalization of the standard “permute-and- 
encrypt” technique [22]. The advantage of this technique 
is that the circuit evaluator only needs to perform a single 
decryption operation to look up an entry in an arbitrarily 
large table. On the other hand, the circuit generator still 
needs to produce and transmit the entire table, so the cost 
for the circuit generator and the bandwidth are high. If 
the table entries have any structure there may be more 
efficient alternatives (see Section 7 for an example). 


3.3 Circuit Library 


Our framework includes a library of circuits defined for 
efficient garbled execution. Applications can be built by 
composing these circuits, but more efficient implementa- 
tions are usually possible when programmers define their 
own custom-designed circuits. 

The hierarchy of circuits is organized following the 
Composite design pattern [6] with respect to the build() 
method. Circuits are constructed in a modular fashion, 
using Wire objects to connect them together. Appendix A 
provides a UML class diagram of the core classes of our 
framework. The Wire and Circuit classes follow a varia- 
tion of the Observer pattern, which offers a kind of pub- 
lish/subscribe functionality [6]. The main difference is 
that when a wire w is connected to a circuit on port p 
(represented as a position index to the inputWires array 
of the circuit), all the observers of the port p automati- 
cally become observers of w. 

The SimpleCircuit abstract class provides a library of 
commonly used functions starting with 2-to-1 AND, OR, 
and XOR gates, where the AND and OR gates are im- 
plemented using Yao’s garbled-circuit technique and the 
XOR gate is implemented using the free-XOR optimiza- 
tion. Implementing a NOT gate is also free since it can 
be implemented as an XOR with constant 1. 

The circuit library also provides more complex circuits 
for, e.g., adders, muxers, comparators, min, max, etc., 
where these circuits were designed to minimize the num- 
ber of non-XOR gates using the techniques described in 
Section 3.2. Optimized circuits for additional functions 
can be added, as needed. A circuit for some desired func- 
tion f can be constructed from the components provided 
in our circuit library, without needing to build the circuit 
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entirely from AND/OR/NOT gates. 
Composite circuits are constructed using the build() 
method, with the general structure shown below: 


public void build() throws Exception { 
createInputWires(); 
createSubCircuits(); 
connectWires(); 
defineOutputWires(); 
fixInternal Wires(); 


} 


To define a new circuit, a user creates a new subclass 
of CompositeCircuit. Typically it is only necessary to 
override the createSubCircuits(), connectWires(), and de- 
fineOutputWires() methods. If internal wires are fixed to 
known values, these can be set by overriding fixInternal- 
Wires(). Our framework automatically propagates known 
signals which improves the run-time whenever any inter- 
nal wires are fixed in this way. For example, given a cir- 
cuit designed to compute the Hamming distance of two 
1024-bit vectors, we can immediately obtain a circuit 
computing the Hamming distance of two 512-bit vectors 
by fixing 512 of each party’s input wires to 0. Because of 
the way we do value propagation, this does not incur any 
evaluation cost. As another example, when running the 
Smith-Waterman algorithm (see Section 6) certain values 
are fixed to public constants and these can be fixed in our 
circuit implementing the algorithm in the same way. 


3.4 Implementation Details 


Throughout this paper, we use 80-bit wire labels for gar- 
bled circuits and statistical security parameter k = 80 
for oblivious-transfer extension. For the Naor-Pinkas 
oblivious-transfer protocol, we use an order-g subgroup 
of Z, with |g| = 128 and |p| = 1024. These settings cor- 
respond roughly to the ultra-short security level as used 
in TASTY [11]. We used SHA-1 to generate the garbled 
truth-table entries. Each entry is computed as: 


Enc’, bj (wer?) = SHA-1 (w?"|lw'/ Ik) @ weirs) 
io 


All cryptographic primitives were used as provided by 
the Java Cryptography Extension (JCE). Our experi- 
ments were performed on two Dell boxes (Intel Core Duo 
E8400 3GHz) connected on a local-area network. 


4 Hamming Distance 


The Hamming distance Hamming(a,b) between two ¢- 
bit strings a = ay_, ---ayag and b = by_, --- by bo is sim- 
ply the number of positions i where b; 4 a;. Here we 
consider secure computation of Hamming(a,b) where 
one party holds a and the other has input b. Secure 
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Figure 1: Circuit computing Hamming distance. 


Hamming-distance computation has been used as a sub- 
routine in several privacy-preserving protocols [15, 26]. 
As part of their SciFI work, Osadchy et al. [26] show 
a protocol based on homomorphic encryption for secure 
computation of Hamming distance. To reduce the on- 
line cost of the computation, SCiFI uses pre-computation 
techniques aggressively. They report that for 2 = 900 
their protocol has an “off-line” running time of 213s and 
an “on-line” running time of 0.31s. (Note that their mea- 
sure of “off-line running time” includes the time for any 
processing done locally by one party before sending a 
message to the other party, even when the local process- 
ing depends on that party’s input.) 


4.1 Circuit-Based Approach 


We explore a garbled-circuit approach to secure Ham- 
ming-distance computation. The high level design of a 
circuit Hamming for computing the Hamming distance is 
given in Figure 1. The circuit first computes the XOR 
of the two ¢-bit input strings v,v’, and then uses a sub- 
circuit Counter to count the number of 1s in the result. 
The output is a k-bit value, where k = [log /]. 

A naive design of the Counter submodule is to use @ 
copies of a k-bit AddOneBit circuit, so that in each of 
the @ iterations the Counter circuit accumulates one bit 
of v@v’ in the k-bit counter. 

Since XOR gates are free and an k-bit Adder needs 
only k non-XOR gates [17], the Hamming circuit with 
the naive Counter needs ¢- [log ¢] non-free gates. We 
improve upon this by changing the Counter design so as 
to reduce the number of gates while enabling the gates to 
be evaluated in parallel. 

First, we observe that the widths of the early one-bit 
adders can be far smaller than k bits. At the first level, 
the inputs are single bits, so a 1-bit adder with carry is 
sufficient; at the next level, the inputs are 2-bits, so a 2-bit 
adder is sufficient. This follows throughout the circuit, 
halving the total number of gates to clbet). 

Second, the serialized execution order is unnecessary. 
We improved the naive design to yield a parallel ver- 
sion of Counter given in Figure 2. Our current execution 
framework does not support parallel execution, but is de- 
signed so that this can be readily supported in a future 
version. 
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Figure 2: Parallelized Counter circuit. 


4.2 Results 


We implemented a secure protocol for Hamming- 
distance computation using the circuit from the previous 
section and the Java framework described in Section 3. 
Computing the Hamming distance between two 900-bit 
vectors took 0.019 seconds and used 56 KB bandwidth 
in the online phase (including garbled circuit generation 
and evaluation), with 0.051 seconds (of which the OT 
takes 0.018 seconds) spent on off-line preprocessing (in- 
cluding garbled circuit setup and the OT extension proto- 
col setup and execution). For the same problem, the pro- 
tocol used in SCiFI took 0.31 seconds for on-line com- 
putation, even at the cost of 213 seconds spent on pre- 
processing.* The SCiFI paper did not report bandwidth 
consumption, but we conservatively estimate that their 
protocol would require at least 110 KB. In addition to 
the dramatic improvement in performance, our approach 
is quite scalable. Figure 3 shows how the running time 
of our protocol scales with increasing input lengths. 

The garbled-circuit implementation has another ad- 
vantage as compared to the homomorphic-encryption 
approach taken by SCIFI: if the obliviously calculated 
Hamming distances are not the final result, but are only 
intermediate results that are used as inputs to another 
computation, then a garbled-circuit protocol is much bet- 
ter in that by its nature it can be readily composed with 
any subsequent secure computation. In contrast, this 
is very inconvenient for homomorphic-encryption-based 
protocols because arbitrary operations over the encryp- 
tions are not possible. As an example, in the SCiFI ap- 
plications the parties do not want to reveal the computed 
Hamming distance h directly but instead only want to 
determine if h > hmax for some public value hingy. Os- 
adchy et al. had to design a special protocol involving 
adding random noise to the / values and using an obliv- 


3Osadchy et al. [26] used a 2.8 GHz dual core Pentium D with 2 GB 
RAM for their experiments, so the comparison here is reasonably close. 
Also note that for their experiments, Osadchy et al. configured their 
host to turn off the Nagle ACK delay algorithm, which substantially 
improved network performance. This is not realistic for most network 
settings and was not done in our experiments. 
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Figure 3: On-line running time of our Hamming-distance 
protocol for different input lengths. 


ious transfer protocol to handle this. In our case, how- 
ever, we would only need to add a comparator circuit af- 
ter the Hamming-distance computation. In fact, with our 
approach further optimizations would be possible when 
max is known since at most the [log ma, | low-order bits 
of the Hamming distance need to be computed. 


5 Levenshtein Distance 


The Levenshtein distance (also known as edit dis- 
tance) between two strings has applications in DNA and 
protein-sequence alignment, as well as text comparison. 
Given two strings @ and f, the Levenshtein distance be- 
tween them (denoted Levenshtein(a, B )) is defined as the 
minimum number of basic operations (insertion, dele- 
tion, or replacement of a single character) that are needed 
to transform q@ into B. In the setting we are concerned 
with here, one party holds @ and the other holds B and 
the parties wish to compute Levenshtein(a, B). 

Algorithm 1 is a standard dynamic-programming al- 
gorithm for computing the Levenshtein distance between 
two strings. The invariant is that D[i][j] always rep- 
resents the Levenshtein distance between @[1...i] and 
B[1...3]. Lines 2-4 initialize each entry in the first row 
of the matrix D, while lines 5—8 initialize the first col- 
umn. Within the two for-loops (lines 8-13), D[i][j] is 
assigned at line 11 to be the smallest of D[i — 1][j]+1, 
Dii][j —1] +1, or D[i —1][j —1] +t (where t is 0 if 
a[i] = B[j] and 1 if they are different). These corre- 
spond to the three basic operations insert a[i], delete 
B[3], and replace a[i] with B[3]. 


5.1 State of the Art 


Jha et al. give the best previous implementation of a se- 
cure two-party protocol for computing the Levenshtein 
distance [16]. Instead of using Fairplay, they developed 
their own compiler based on Fairplay, while borrow- 
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Algorithm 1 Levenshtein(a, B) 


: Initialize D[(a.length][B.length]; 
: for i + 0 to @.length do 
D[i][0] + i; 
: end for 
: for j < Oto B.length do 
Diol[s] <3: 
: end for 
: for i + 1 to @.length do 
for j <1 to B.length do 
t < (ali] = Blj]) 20:4: 
D[aj[j] < min(D[i — 1][5}+1, Dla][y — a}41, 
Dla — A][j — J+); 
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end for 
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ing the function-description language (SFDL) and the 
circuit-description language (SHDL) directly from Fair- 
play. Jha et al. investigated three different strategies for 
securely computing the Levenshtein distance. Their first 
protocol (Protocol 1) directly instantiated Algorithm 1 
as an SFDL program, which was then compiled into a 
garbled-circuit implementation. Because their garbled- 
circuit execution approach required keeping the entire 
circuit in memory, they concluded that garbled circuits 
could not scale to large inputs. The largest problem size 
their compiler and execution environment could handle 
before crashing was where the parties’ inputs were 200- 
character strings over an 8-bit (256-character) alphabet. 

Their second protocol combined garbled circuits with 
an approach based on secure computation with shares. 
The resulting protocol was scalable, but extremely slow. 
Finally, they proposed a hybrid protocol (Protocol 3) by 
combining the first two approaches to achieve better per- 
formance with scalability. 

According to their results, it took 92 seconds for Pro- 
tocol 1 to complete a problem of size 100 x 100 (.e., 
two strings of length 100) over an 8-bit alphabet. This 
protocol required nearly 2 GB of memory to handle the 
200 x 200 case [16]. Their flagship protocol (Protocol 3), 
which is faster for larger problem sizes, took 658 sec- 
onds and used 364.3 MB bandwidth on a problem of 
size 200 x 200 over an 8-bit alphabet. 


5.2 Circuit-Based Approach 


We observed that the circuit used for secure computation 
of Levenshtein distance can be much smaller than the cir- 
cuit produced from a high-level SFDL description. The 
main reason is that the SFDL description does not dis- 
tinguish parts of the computation that can be performed 
locally by one of the parties, nor does it take advantage 
of the actual number of bits required for values at inter- 
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mediate stages of the computation. 

The portion of the computation responsible for initial- 
izing the matrix (lines 2—7) does not require any collabo- 
ration, and thus can be completed by each party indepen- 
dently. Moreover, since the length of each party’s private 
string is not meant to be kept secret, the two for-loops 
(lines 8-9) can be managed by each party independently 
as long as they keep the inner executions synchronized, 
leaving only two lines of code (lines 10—11) in the inner- 
most loop that need to be computed securely. 

Let @ denote the length of the parties’ input strings, as- 
sumed to be over a O-bit alphabet. Figure 5a presents a 
circuit, LevenshteinCore, that is computationally equiv- 
alent to lines 10-11. The T (stands for “test’”) circuit in 
that figure outputs | if the input strings provided are dif- 
ferent. Figure 4 shows the structure of the T circuit. (For 
the purposes of the figures in this section, we assume 
o = 2 since this is the alphabet size that would be used 
for genomic comparisons. Nevertheless, everything gen- 
eralizes easily to larger o.) For a o-bit alphabet, the T 
circuit uses o — | non-free gates. 

The rest of the circuit computes the minimum of the 
three possible edits (line 11 in Algorithm 1). We be- 
gin with the straightforward implementation shown in 
Figure 5a. The values of D[i—1][j], D[i][j—1], and 
D[i — 1][j — 1] are each represented as ¢-bit inputs to the 
circuit. For now, this is fixed as the maximum value of 
any D[i][j] value. Later, we reduce this to the maximum 
value possible for a particular core component. Because 
of the way we define @ there is no need to worry about 
the carry output from the adders since @ is defined as the 
number of bits needed to represent the maximum out- 
put value. The circuit shown calculates exactly the same 
function as line 11 of Algorithm 1, producing the out- 
put value of D[i][j]. The full Levenshtein circuit has one 
LevenshteinCore component for each i and j value, con- 
nected to the appropriate inputs and producing the output 
value D[i][j]. The output value of the last Levenshtein- 
Core component is the Levenshtein distance. 

Recall that each ¢-bit AddOneBit circuit uses @ non- 
free gates, and each (¢-bit 2-MIN uses 2¢ non-free 
gates. So, for problems on a o-bit alphabet, each ¢-bit 
NaiveLevenshteinCore circuit uses 7+ o — | non-free 
gates. Next, we present two optimizations that reduce 
the number of non-free gates involved in computing the 
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Figure 4: T circuit. 
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Figure 5: Implementations of the Levenshtein core circuit. 


Levenshtein core to 5+ 0. 

Since min(D[i — 1][j] + 1,D[4][j —1] +1) is equiv- 
alent to min(D[i— 1][3],D[i][j — 1]) +1, we can com- 
bine the two AddOneBit circuits (at the top left of Fig- 
ure 5a) into a single one, and interchange it with the sub- 
sequent 2-MIN as shown in Figure 5b. The circuits in 
the dashed box in Figure 5b compute min(x+1,y+t), 
where t € {0,1}. This is functionally equivalent to: 


if (y > x) then x+1 else y+t. 


Hence, we can reuse one of the AddOneBit circuits by 
putting it after the GT logic embedded in the MIN cir- 
cuit. This leads to the optimized circuit design shown 
in Figure 5c. Note that the 1-bit output wire connect- 
ing the 2-MIN and 1-bit MUX circuits is essentially the 
1-bit output of the GT sub-circuit inside 2-MIN. This 
change reduces the number of gates in the core circuit 
to2x 2é+f+o—1+1=S5l+o. 

The final optimization takes advantage of the obser- 
vation that the minimal number of bits needed to repre- 
sent D[i][j] varies throughout the computation. For ex- 
ample, one bit suffices to represent D[1][1] while more 
bits are required to represent D[i][j] for larger i’s and 
j’s. The value of D[i][j] can always be represented using 
[log min(i, j)] bits. The number of gates decreases by: 


a ry [log [min(i, /)] 
[log , 





For € = 200 this results in a 25% savings, but the effect 
decreases as ¢ grows. 

Although it would be possible to describe such a cir- 
cuit using a high-level language like SFDL, it would be 
very tedious and awkward to do so and would require a 
customized program for each input size. Hence, SFDL 
programs tend to allocate more than the number of bits 
needed to ensure correctness of the protocol output. 
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5.3 Results 


We implemented a protocol for secure computation of 
Levenshtein distance using the circuit described above 
and our framework from Section 3. The protocol handles 
arbitrary input lengths @ (it also handles the case where 
the input strings have different lengths) and arbitrary al- 
phabet sizes 2°. It completes a problem of size 200 x 200 
over a 4-character alphabet in 16.38 seconds (of which 
less than 1% is due to OT) using 49 MB bandwidth. The 
dependence of the running time on o is small: for o = 8 
our protocol takes 18.4 seconds in the 200 x 200 case, 
which is 29 times faster than the results of Jha et al. [16]. 

Our protocol is highly scalable, as shown in Figure 6. 
The largest problem instance we ran is 2000 x 10000 
(not shown in the figure), which used a total of 1.29 bil- 
lion non-free binary gates and completed in under 223 
minutes (at a rate of over 96,000 gates per second). In 
addition, our approach enables further optimizations for 
many practical scenarios. For example, if the parties are 
only interested in determining whether the Levenshtein 
distance is below some threshold d, then only the [log d] 
low-order bits of the result need to be computed and the 
number of bits for an entry can be reduced. 


6 Smith-Waterman 


The Smith-Waterman algorithm (Algorithm 2) is a popu- 
lar method for genome and protein alignment [23,31]. In 
contrast to Levenshtein distance which measures dissimi- 
larity, the Smith-Waterman score measures similarity be- 
tween two sequences (higher scores mean the sequences 
are more similar). The algorithm has a basic structure 
similar to the algorithm for computing Levenshtein dis- 
tance. The differences are: (1) the preset entries (the first 
row and the first column) are initialized to 0; (2) the al- 
gorithm has a more sophisticated core (lines 10-12) that 
involves an affine gap function gap and computes the 
maximum score across all previous entries in the row and 
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Figure 6: Overall running time of our Levenshtein- 
distance protocol. (Plotted on a log-log scale; the prob- 
lem size is 200 x DNA Length and o = 2.) 


column; and (3) the algorithm uses a fixed 2-dimensional 
score matrix score. 

In practice, the gap function is typically of the form 
gap(x) =a+b-x where a, b are publicly known, neg- 
ative integer constants. By choosing a and b appropri- 
ately, one can account for the fact that the evolutionary 
likelihood of inserting a single large DNA segment is 
much greater than the likelihood of multiple insertions 
of smaller segments (of the same total length). A typical 
gap function is gap(x) = —12 — 7x, which is what we 
use in our evaluation experiments. 

The 2-dimensional score matrix score quantifies how 
well two symbols from an alphabet match each other. In 
comparing proteins, the symbols represent amino acids 
(one of twenty possible characters including stop sym- 
bols). The entries on the diagonal of the score matrix 
are larger and positive (since each symbol aligns well 
with itself), while all others are smaller and mostly neg- 
ative numbers. The actual numbers vary, and are com- 
puted based on statistical analysis of a genome database. 
We use the BLOSUM62 [12] score matrix for computa- 
tion over randomly generated protein sequences. 

To obtain the optimal alignment, one first computes 
matrix D using Algorithm 2, then finds the entry in D with 
the maximum value and traces the path backwards to find 
how this value was derived. In a privacy-preserving set- 
ting, the full trace may reveal too much information. In- 
stead, it may be used as an intermediate value for a con- 
tinued secure computation, or just aspects of the result 
(e.g., the score or starting position) could be revealed. 


6.1 State of the Art 


The only previous attempt to implement a secure Smith- 
Waterman computation is by Jha et al. [16]. (An alternate 
approach, suggested by Szajda et al. [32], is to perform 
the computation normally but operating on transformed 
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Algorithm 2 Smith-Waterman(q@, B, gap, score) 





: Initialize D[(a.length][B .length]; 

: for i + 0 to @.length do 

D[i][0] < 0; 

: end for 

: for j| — Oto B.length do 

D[o]Li] <0: 

: end for 

: for i+ 1 to a.length do 

for j < 1 to B.length do 
rilax + maxice<s(D[i - o][3] + 9ap(0)) 
cMax ¢ max1<o<j(D[i][j — 0] + gap(o)); 

12: D[i][j] < max(0,rMax, cMax, 

Dla A][j— 1] + score(a[a]](B[s]): 


COI DAKRWN 


7 





ay 
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13: end for 
14: end for 





data instead of the parties’ private data. It is unclear, 
however, what privacy or correctness properties can be 
achieved by this approach.) Jha et al.’s protocol follows 
a similar approach to their Levenshtein-distance proto- 
cols described in Section 5, and led them to conclude 
that garbled-circuit implementations could not handle 
even small inputs (their garbled-circuit implementation 
for Smith-Waterman could not handle a 25x25 size in- 
put). Hence, they invented a hybrid protocol (Protocol 3) 
to implement the Smith-Waterman algorithm. 

Their prototype had two limitations that prevent direct 
performance comparisons: 


1. They use only 8 bits to represent each entry of the 
dynamic-programming matrix, but for most protein- 
alignment problems the similarity scores between 
even two short sequences of length 25 can overflow 
an 8-bit integer, and for larger sequences it is bound 
to overflow. In the BLOSUM62 scoring table, the 
typical score for two matching proteins is 6 (and as 
high as 11). 


2. They used a constant gap function (gap(x) = —4) 
that is inappropriate for practical scenarios. 


Despite these simplifications in their work, our complete 
Smith-Waterman implementation (that does not make 
any of these simplifications) still runs more than twice 
as fast as their implementation. 


6.2 Circuit-Based Approach 


The core of the Smith-Waterman algorithm (lines 10- 
12 of Algorithm 2) involves ADD and MAX circuits. To 
reduce the number of non-free gates, we replace lines 
10-11 with the code in Algorithm 3. This allows us to 
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Algorithm 3 Restructured Smith-Waterman core 





rMax < 0; 
for o+ 1 toido 

rMax «+ max(rMax, D[i — o][j]+ gap(o)); 
end for 
cMax + 0; 
for o+ 1 to j do 

cMax <~ max(cMax, D[i][j — 0] + gap(o)); 
end for 





use much narrower ADD and MAX circuits for some en- 
tries since we know the value of D[i][j] is bounded by 
[log (min(i, j)) - maxscore)|, where maxscore is the great- 
est number in the score matrix. We only need to make 
sure that values are appropriately sign-extended (a free 
operation) when they are carried between circuits of dif- 
ferent width. 

We also note that gap(o), which serves as the second 
operand to every ADD circuit, can always be safely com- 
puted without collaboration since it does not depend on 
any private input. Thus, instead of computing gap(o) us- 
ing acomplex garbled circuit, it can be computed directly 
with the output value fed directly into the ADD circuit. 
Being able to tightly bound the part of the computation 
that really needs to be done privately is another advan- 
tage of our approach. 

The matrix-indexing operation on score does need to 
be done in a privacy-preserving way since its inputs re- 
veal symbols in the private inputs of the parties. Since 
the row index and column index each can be denoted 
as a 5-bit number, we could view the score table as 
a 10-to-1 garbled circuit (whereas each entry in truth 
table is an encryption of 5 wire keys representing the 
output value). Using an extension of the permute-and- 
encrypt technique, it leads to a garbled table contain- 
ing 2!° = 1024 ciphertexts (of which 624 are null en- 
tries since the actual table is 20 x 20, but which must 
be transmitted as random entires to avoid leaking infor- 
mation). However, observe that one of the two indexes 
is known to the circuit generator since it corresponds to 
the generator’s input value at a known location. Hence, 
we use the index known to the circuit generator to spe- 
cialize the two-dimensional score table lookup to a one- 
dimensional table lookup. This reduces the cost of obliv- 
ious table lookup to computing and transmitting 20 ci- 
phertexts and 12 random entries (to fill the 2°-entry ta- 
ble) for the circuit generator, while the work for the cir- 
cuit evaluator is still performing one decryption. 


6.3 Results 


Our secure Smith-Waterman protocol takes 415 seconds 
and generates 1.17 GB of network traffic running on two 
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Figure 7: Overall running time of the Smith-Waterman 
protocol. (Plotted on a log-log scale; problem size 
20x Codon Sequence Length.) 


protein sequences of length 60. The garbled-circuit im- 
plementation by Jha et al. did not scale to a 60x60 input 
size, but their Protocol 3 was able to complete on this 
input length in nearly 1000 seconds (but recall that due 
to simplifications they used, their implementation would 
not usually produce the correct result). Figure 7 shows 
the running time of our implementation as a function of 
the problem size. 


7 AES 


AES is a standardized block cipher. We focus on AES- 
128 which uses a 128-bit key as well as a 128-bit block 
length. The high-level operation of AES is shown in List- 
ing | (based on Daemen and Rijmen’s report [3]). It takes 
a 16-byte array msg and a large byte array key, which is 
the output of the AES key schedule. The variable Nr de- 
notes the number of rounds (for AES-128, Nr=10). 

In privacy-preserving AES, one party holds the key k 
and the other holds an input block x. At the end of the 
protocol, the second party learns AES; (x). This function- 
ality has a number of interesting applications including 
encrypted keyword search (see Pinkas et al. [27]). 


7.1 Prior Work 


Pinkas et al. [27] implement AES as an SFDL program, 
which is in turn compiled to a huge SHDL circuit con- 
sisting of more than 30,000 gates. Henecka et al. [11] 
used the same circuit, but obtained better online perfor- 
mance by moving more of the computation to the pre- 
computation phase. The best performance results they 
reported are 3.3 seconds in total and 0.4 seconds online 
per block-cipher evaluation. 


7.2 Our Approach 


We also use garbled circuits to implement privacy- 
preserving AES. However, our technique is distinguished 
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public static byte[] Cipher(byte[] key, byte[] msg) { 
byte[] state = AddRoundKey(key, msg, 0); 
for (int round = 1; round < Nr; round++) { 
state = SubBytes(state); 
state = ShiftRows(state); 
state = MixColumns(state); 
state = AddRoundKey(key, state, round); 


} 


state = SubBytes(state); 

state = ShiftRows(state); 

state = AddRoundKey(key, state, Nr); 
return state; 


Listing 1: The AES block cipher. 


from previous ones in that instead of constructing a huge 
circuit, we derive our privacy-preserving implementation 
around the structure of a traditional program, following 
the code in Listing 1. Our guiding principle is to iden- 
tify the minimal subset of the computation that needs to 
be performed in a privacy-preserving manner, and only 
use garbled circuits for that portion of the computation. 
Specifically, we observe that the entire key schedule can 
be computed locally by the party holding the key. There 
is no need to use garbled circuits to compute the key 
schedule since it only depends on one party’s data. 


Overview. To make the implementation simpler, we ex- 
plicitly group the wire labels of every 8-bit byte into a 
State object, representing the intermediate results of gar- 
bled circuits. Compared to the original code (Listing 1), 
we only need to replace the built-in data type byte with 
our custom type State in building the code for imple- 
menting the garbled circuit. Since the state is repre- 
sented by garbled wire labels, we can compose circuits 
implementing each execution phase to perform the se- 
cure computation. 

As noted earlier, the value of the key which is the 
output of the key schedule can be executed by Alice 
alone, and then used as effective input to a circuit. This 
enables us to replace the expensive privacy-preserving 
key schedule computation with less expensive oblivious 
transfers (which, due to the oblivious-transfer extension, 
are cheaper than using garbled circuits). 

Second, as in many other real-world AES cipher im- 
plementations, the SubBytes subroutine dominates the 
resource (e.g., time and hardware area) consumption. 
We consider two possible designs for implementing the 
SubBytes subroutine. The first design minimizes online 
time for situations where preprocessing is possible; the 
second minimizes total time in the absence of idle peri- 
ods for preprocessing. 


Third, the ShiftRows subroutine imposes no cost for 
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our circuit implementation since this subroutine merely 
impacts the wiring but requires no additional gates. 


The MixColumns subroutine requires secure compu- 
tation, but we design a circuit for this that uses only 
XORs. The AddRoundKey subroutine is realized by a Bit- 
WiseXOR circuit that simply juxtaposes 128 XOR gates. 


SubBytes. The SubBytes component dominates the 
time for AES, so we consider two alternate designs. 


Minimizing online time. Our first design seeks to min- 
imize the online execution time by moving as much of 
the work as possible to the preprocessing phase. The 
SubBytes subroutine can be implemented with sixteen 
8-bit-to-8-bit garbled tables, similar to the score matrix 
used in the Smith-Waterman application. From the per- 
spective of the circuit generator, this results in a garbled 
“gate” with 2° x 8 = 2048 ciphertexts. The circuit eval- 
uator need only decrypt 8 of these (i.e., one table entry) 
at a cost of 4 hash evaluations (since we use 80-bit wire 
labels and SHA-1, with 160-bit output length, for the en- 
cryption). This design is distinguished by its very low 
online cost, so is well suited to situations where the pri- 
mary goal is to minimize the online execution time. 


Minimizing total time. Our second design aims to 
minimize the total execution time by implementing 
SubBytes with an efficient circuit derived from the work 
of Wolkerstorfer et al. [33]. The two logical components 
of SubBytes are computing an inverse over GF(2°) and 
an affine transformation over GF(2). The circuit we 
use to compute the inverse over GF(2°) is given in Fig- 
ure 8. In essence, GF(2°) is viewed as an extension 
of GF(2*), so that an element of GF(2°) is mapped to 
a vector of length two over GF(2*). A series of oper- 
ations over GF(2*) are applied to these values, which 
are then mapped back to an element in GF(2°). In this 
circuit diagram, Map and Inverse Map circuits realize 
the bijection between GF(2°) and (GF(2*))*; @ and @ 
represent addition and multiplication over GF(2*), re- 
spectively. The affine transform over finite field GF(2) 
and all of the component circuits except for the ® and 
GF(2*)Inverse circuits can be implemented using XOR 
gates alone. Since each ® circuit has 16 non-free gates 
and each GF(2*)Inverse has 10 non-free gates, the to- 
tal number of non-free gates per GF(2°)Inverse circuit 
is 16x 3+10=58. 


MixColumns. The core functionality of MixColumns is 
to compute s/.(x) = a(x) @ s¢(x), where 0 < c < 4 speci- 
fies the column, a(x) = {03}x7 + {01 }x? + {01 }x+ {02}, 
and @ denotes multiplication over finite field GF(2°). 
Let sc(x) = 53 .-x7 + 52,ox* + 51,eX + Soe and s/(x) = 
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Figure 8: Inverse Circuit over GF(2°). 


Se + ee +s) «+50... This is equivalent to 








Soc = ({02}-50,c) B ({03} - s1,c) B52,¢ B53, 
Sie = Soc ({02}- 51,¢) B ({03} - 52.) B53, 
Spe = $0,c B51, B ({02} - 52.) B ({03} + 53.) 
83. = ({03}- soc) Bs1,¢ P52, B ({02} -53-). 











It follows that: 














SOc aad ({02} “S0,c) ® ({02} “Sijc) DSi, © S2.¢ DO S3,¢ 
Sic = 80, ({02}-51,c) B ({02} - 52,c) B52,¢ B $3,c 
Sye¢ = $0, BS1,c B ({02} - 52,c) B ({02}  $3,c) B53,c 
536 = ({02} ‘S0,c) ® S0,.c BS1¢ P5260 ({02} “S3.¢)- 








The operation {02}-b is defined as multiplying b by 
{02} modulo {1b} in GF(2°). If b = b7---bibo, and 
Z=77-++21Z0 = {02} -b, the output bits can be computed 
using only XOR gates: 


z4 = b3 Bb, 
zo = b7 


25 = ba, 
Zz =bo ®2z7, 


26 = bs, 
2=b,, 


77 = be, 
73 = bo Bb, 


For every column of 4-byte numbers, the equations 
above are implemented by the MixOneColumn circuit 
(Figure 9). Each invocation of MixColumns involves 
processing four columns, so we can build the Mix- 
Columns circuit by juxtaposing four MixOneColumn cir- 
cuits. Thus, the MixColumns circuit can be implemented 
using only XOR gates. 


7.3 Results 


Using the first (online-minimizing) SubBytes design, 
there are no non-free gates and 160 oblivious table 
lookups. The total time for the computation is 1.6 sec- 
onds without preprocessing. With preprocessing, the on- 
line time to evaluate the circuit is 0.008 seconds (since 
the evaluator can always identify the right entry in the 
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Figure 9: MixOneColumn Circuit. 


table to decrypt), more than 50 times faster than the best 
previous results [11]. 


With our second design, the total number of non-free 
gates for the entire AES computation is 58 x 16 x 10 = 
9280. The overall time is 0.2 seconds (of which 0.08 
seconds is spent on oblivious transfer) without prepro- 
cessing, more than 16 times faster than the best previous 
results [11]. The online time is 0.06 seconds with pre- 
processing enabled. 


8 Conclusion 


Misconceptions about the performance and scalability of 
garbled circuits are pervasive. This perception has led 
to the development of several complex, special-purpose 
protocols for problems that are better addressed by gar- 
bled circuits. We demonstrate that a simple pipelining 
approach, along with techniques to minimize circuit size, 
is enough to make garbled circuits scale to many large 
problems, and practical enough to be competitive with 
special-purpose protocols. 


We hope improvements in the efficiency of privacy- 
preserving computing will enable many sensitive appli- 
cations to be deployed. Ours is just a first step towards 
that goal, and more work needs to be done before se- 
cure computation can be used routinely in practice. Al- 
though our approach enables circuits to scale arbitrarily 
and make evaluation substantially faster than previous 
work, it is still far slower than normal computation. Fur- 
ther performance improvements are needed before large 
problems can be computed securely in interactive sys- 
tems. In addition, our work assumes the semi-honest 
threat model which is only suitable for certain scenarios 
where only one party obtains the output or both parties 
can rely on verified implementations. Efficient protocols 
secure against a malicious adversary model appear to be 
much more challenging to design. 
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A Core Classes 


The core classes in our framework are shown in the UML diagram below. 


par <<interface>> 
TransitiveObservable Trancitivacuecrvar 
a 


Circuit -> TransitiveObserver | 
<<realize>> ' 


Circuit 























Wire 
value : int inputWires : Wire[] 
Ibl : BiglInteger outputWires : Wire[] 
invd : boolean buldd: vold 


connectTo(ws : Wire[],idx : int) : void startExecuting(s : State) : State 
fixWire(v : int) : void update(o : TransitiveObservable,arg : Object) : void 























CompositeCircuit 
subCircuits : Circuit[] 


build() : void 

createSubCircuits() : void 
connectWires() : void build() : void 
defineOutputWires() : void execute() : void 
fixInternalWires() : void 





SimpleCircuit_2_1 
gtt : BigInteger[][] 
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